More

coder543 · on Dec 22, 2023

Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.

[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

tome · on Dec 23, 2023

Yup, deploying Mixtral is a work in progress. Watch this space!

coder543 · on Dec 21, 2023

I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.

Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494

If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.

coder543 · on Dec 21, 2023

They released a base model. It is not instruction-tuned, so it won't really follow instructions unless you fine-tune it to do that.

"There are lots of Mistral fine-tunes. Why another one?

A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."

m3kw9 · on Dec 21, 2023

Then how come the base model can somewhat follow instructions but not very well, or why is it that the base model won’t follow instructions well?

coder543 · on Dec 21, 2023

Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.

When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).

A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.

OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.

[0]: https://huggingface.co/datasets/Open-Orca/OpenOrca

coder543 · on Dec 21, 2023

The article shows (fine tuned) Mistral 7B outperforming GPT-4, never mind GPT-3.5.

m3kw9 · on Dec 21, 2023

This model is not close to even 3.5 from when I used it. It first of all does not follow instructions properly and it just runs on and on

coder543 · on Dec 21, 2023

What you're describing is the behavior you get from any base model that has not been instruction-tuned. The article is clear that this model is not for "direct use". It needs tuning for a specific application.

m3kw9 · on Dec 21, 2023

how does one fine tune it to follow instructions? I would have thought they have open source training set for these instruction-follow fine tunes?

coder543 · on Dec 20, 2023

"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)

HPsquared · on Dec 20, 2023

PowerToys are probably the original (going back to PowerToys for Windows 95)

Edit: https://socket3.wordpress.com/2016/10/22/using-windows-95-po...

coder543 · on Dec 20, 2023

PowerPoint existed in the late 80s, I think, although Microsoft acquired it from what I understand.

latchkey · on Dec 20, 2023

https://en.wikipedia.org/wiki/PowerPC

coder543 · on Dec 19, 2023

Why filter out the votes made after only one or two prompts? A lot of times, a single response is all you need to see.

Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png

Avatar doesn’t have humans? Seriously?

nabakin · on Dec 19, 2023

The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.

Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.

I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.

coder543 · on Dec 19, 2023

Mixtral ranks higher than Gemini Pro on the (subjective) Chatbot Arena Leaderboard: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Where are you seeing that it is "further behind Gemini Pro than Gemini Pro is behind GPT 3.5"?

jsnell · on Dec 19, 2023

Presumably in the very article this HN submission is for (https://arxiv.org/pdf/2312.11444.pdf), table 1.

coder543 · on Dec 19, 2023

Mixtral is missing in half of the benchmarks in that paper. Hardly conclusive. It’s also common knowledge that these benchmarks have a lot of issues[0]. A good litmus test, but not a substitute for actually seeing how the models do in the real world.

On the topic of “hardly conclusive” things, Gemini Pro literally told me just a few minutes ago[1] that the Avatar movies did not have humans in them. There was no funny business in the prompting. At least Mixtral knows that Avatar has humans in it. Most of Gemini Pro’s responses have been fine, but not exceptional.

[0]: one random article talking about these issues: https://www.surgehq.ai//blog/hellaswag-or-hellabad-36-of-thi...

[1]: https://i.imgur.com/En37EJD.png

coder543 · on Dec 19, 2023

But, what if you could make an SAT that is equivalent to evaluating years of performance at work?

https://huggingface.co/papers/2306.05685

This paper makes the argument that...

"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.

_kuvn · on Dec 19, 2023

My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details

coder543 · on Dec 19, 2023

GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.

Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.

I don’t think it would be a replacement for human rating, but it would be interesting to see.

coder543 · on Dec 19, 2023

I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).

GaggiX · on Dec 19, 2023

Solar-10.7B is present in the battle arena but there are probably not enough votes for the ranking.

unstuck3958 · on Dec 20, 2023

> like the new Phi-2 model

Phi-2 isn't fine tuned for instruction following yet.

coder543 · on Dec 18, 2023

Dupe: https://news.ycombinator.com/item?id=38682631