Hacker Newsnew | past | comments | ask | show | jobs | submit | more coder543's commentslogin

Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.

[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...


Yup, deploying Mixtral is a work in progress. Watch this space!


I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.

Something potentially helpful here: https://github.com/ggerganov/llama.cpp/discussions/2494

If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.


They released a base model. It is not instruction-tuned, so it won't really follow instructions unless you fine-tune it to do that.

"There are lots of Mistral fine-tunes. Why another one?

A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."


Then how come the base model can somewhat follow instructions but not very well, or why is it that the base model won’t follow instructions well?


Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.

When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).

A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.

OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.

[0]: https://huggingface.co/datasets/Open-Orca/OpenOrca


The article shows (fine tuned) Mistral 7B outperforming GPT-4, never mind GPT-3.5.


This model is not close to even 3.5 from when I used it. It first of all does not follow instructions properly and it just runs on and on


What you're describing is the behavior you get from any base model that has not been instruction-tuned. The article is clear that this model is not for "direct use". It needs tuning for a specific application.


how does one fine tune it to follow instructions? I would have thought they have open source training set for these instruction-follow fine tunes?


"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)


PowerToys are probably the original (going back to PowerToys for Windows 95)

Edit: https://socket3.wordpress.com/2016/10/22/using-windows-95-po...


PowerPoint existed in the late 80s, I think, although Microsoft acquired it from what I understand.



Why filter out the votes made after only one or two prompts? A lot of times, a single response is all you need to see.

Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png

Avatar doesn’t have humans? Seriously?


The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.

Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.

I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.


Mixtral ranks higher than Gemini Pro on the (subjective) Chatbot Arena Leaderboard: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Where are you seeing that it is "further behind Gemini Pro than Gemini Pro is behind GPT 3.5"?


Presumably in the very article this HN submission is for (https://arxiv.org/pdf/2312.11444.pdf), table 1.


Mixtral is missing in half of the benchmarks in that paper. Hardly conclusive. It’s also common knowledge that these benchmarks have a lot of issues[0]. A good litmus test, but not a substitute for actually seeing how the models do in the real world.

On the topic of “hardly conclusive” things, Gemini Pro literally told me just a few minutes ago[1] that the Avatar movies did not have humans in them. There was no funny business in the prompting. At least Mixtral knows that Avatar has humans in it. Most of Gemini Pro’s responses have been fine, but not exceptional.

[0]: one random article talking about these issues: https://www.surgehq.ai//blog/hellaswag-or-hellabad-36-of-thi...

[1]: https://i.imgur.com/En37EJD.png


But, what if you could make an SAT that is equivalent to evaluating years of performance at work?

https://huggingface.co/papers/2306.05685

This paper makes the argument that...

"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.


My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details


GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.

Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.

I don’t think it would be a replacement for human rating, but it would be interesting to see.


I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).


Solar-10.7B is present in the battle arena but there are probably not enough votes for the ranking.


> like the new Phi-2 model

Phi-2 isn't fine tuned for instruction following yet.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: