Is there any plan to show what this hardware can do for Mixtral-8x7B-Instruct? Based on the leaderboards[0], it is a better model than Llama2-70B, and I’m sure the T/s would be crazy high.
I haven't used the llama2 models much in quite awhile, because they just aren't very good compared to other options that exist at this point. The instruction-tuned variants of Mistral and Mixtral seem to have very little trouble responding in JSON when I ask for it. However, with LLMs that you run yourself, you can also enforce a grammar for the response if you want to, guaranteeing that it will respond with valid JSON (that matches your schema!) and no extraneous text.
If you fine-tuned a base model (like the one in the article) on various inputs and the expected JSON output for each input, it would probably do even better.
They released a base model. It is not instruction-tuned, so it won't really follow instructions unless you fine-tune it to do that.
"There are lots of Mistral fine-tunes. Why another one?
A very healthy ecosystem of Mistral fine-tunes already exists, but they’re typically optimized for direct use. We wanted something different — a model optimized to be the strongest base model for further fine-tunes to be built on."
Base models are just trying to autocomplete the input text. The most logical completion for an instruction is something approximately like what you asked, but base models are raw. They have not been taught to follow instructions, so they generally do a poor job. They're especially bad at knowing when to stop, and they will often generate their own questions to answer, which they will then answer, followed by more questions and more answers.
When chat models are trained, they are first pre-trained (the "PT" in "GPT"), which creates a base model, then they are "fine tuned" (RLHF, aligned, whatever you want to call it).
A base model can be fine tuned with an instruction dataset (like OpenOrca[0]) to learn how to follow instructions or how to chat. It can also be fine-tuned with a collection of any inputs and the expected outputs, and learn how to do that specific task.
OpenPipe appears to specialize in fine-tuning base models for specific applications. They wanted a better base model. If you want it instruction-tuned, I'm sure they would be happy to help with that, or you can wait for someone in the community to make one of those from their base model... but I believe the whole point of the article is that a small, specialized model can outperform a large, general model. Their goal does not seem to be to build a tiny, general, chat-tuned model that outperforms GPT-4 in everything. They want you to train the base model on a very specific task, with the expectation that it will outperform GPT-4 and be tremendously cheaper to run at the same time. Many LLM tasks are centered around summarization, extraction, or classification, which have nothing to do with chatting.
What you're describing is the behavior you get from any base model that has not been instruction-tuned. The article is clear that this model is not for "direct use". It needs tuning for a specific application.
"Power*" made me think of Microsoft, so I was almost expecting this to be Windows-specific. (PowerShell, PowerPoint, Power BI, Power Apps, Power Automate... I'm probably forgetting some.)
The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.
Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.
I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.
Mixtral is missing in half of the benchmarks in that paper. Hardly conclusive. It’s also common knowledge that these benchmarks have a lot of issues[0]. A good litmus test, but not a substitute for actually seeing how the models do in the real world.
On the topic of “hardly conclusive” things, Gemini Pro literally told me just a few minutes ago[1] that the Avatar movies did not have humans in them. There was no funny business in the prompting. At least Mixtral knows that Avatar has humans in it. Most of Gemini Pro’s responses have been fine, but not exceptional.
"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."
So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.
My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details
GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.
Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.
I don’t think it would be a replacement for human rating, but it would be interesting to see.
I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).
[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...