An In-depth Look at Gemini's Language Abilities

unstuck3958 · on Dec 19, 2023

It's incredible how accurate the Chatbot Arena Leaderboard [0] is at predicting model performance compared to benchmarks (which can and are being gamed, see all the 7B models on HF leaderboard)

[0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

paxys · on Dec 19, 2023

It's because it isn't "predicting" anything, but rather aggregating user feedback. That is of course going to be closest to judging the subjective "best" model that pleases most people.

It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.

coder543 · on Dec 19, 2023

But, what if you could make an SAT that is equivalent to evaluating years of performance at work?

https://huggingface.co/papers/2306.05685

This paper makes the argument that...

"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.

_kuvn · on Dec 19, 2023

My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details

coder543 · on Dec 19, 2023

GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.

Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.

I don’t think it would be a replacement for human rating, but it would be interesting to see.

coder543 · on Dec 19, 2023

I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).

GaggiX · on Dec 19, 2023

Solar-10.7B is present in the battle arena but there are probably not enough votes for the ranking.

unstuck3958 · on Dec 20, 2023

> like the new Phi-2 model

Phi-2 isn't fine tuned for instruction following yet.

s-macke · on Dec 19, 2023

In fact, the performance differences between the models are so significant that even a micro benchmark demonstrates their capabilities.

For example, consider my analysis [0] based on observing the progression of Large Language Models (LLMs) in a single text adventure.

[0] https://github.com/s-macke/AdventureAI#evaluation-of-other-m...

nabakin · on Dec 19, 2023

It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results.

GaggiX · on Dec 19, 2023

These are the rules of the battle arena:

-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!

-You can continue chatting until you identify a winner.

-Vote won’t be counted if model identity is revealed during conversation.

nabakin · on Dec 19, 2023

Perfect ty!

thomasahle · on Dec 19, 2023

It's not completely blind/anonymous, since you can just ask "What's your name" and the model will identify itself.

Edit: I missed the third rule. I wonder how smart their detection is.

GaggiX · on Dec 19, 2023

That's why the third rule exists.

coder543 · on Dec 19, 2023

Why filter out the votes made after only one or two prompts? A lot of times, a single response is all you need to see.

Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png

Avatar doesn’t have humans? Seriously?

nabakin · on Dec 19, 2023

The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.

Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.

I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.

londons_explore · on Dec 19, 2023

The chart on the bottom left corner of that page shows quite how far ahead the various GPT-4 models are compared to everyone else...

_giorgio_ · on Dec 19, 2023

I've used the Arena a lot, and the differences between models are very clear 90% of the times.

I only make technical (pytorch) questions though.

3abiton · on Dec 19, 2023

Thanks for the reference I was searching for a benchmark that can quantify the typical user experience, as most synthetic ones are completly ineffective. At what sample size the ranking become significant? Or is it baked in the metrics (ELO)?

bitshiftfaced · on Dec 19, 2023

Elo converges on stable scores fairly quickly, depending on the K-factor. I wouldn't think it would be much of an issue at all for something like this, since you can ensure you're testing against every other member (avoiding "Elo islands"). But obviously the more trials the better.

The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."

moffkalast · on Dec 19, 2023

It's astounding that Mixtral Instruct ties with 3.5-turbo while being ~10x smaller.

AdrienBrault · on Dec 19, 2023

3.5-turbo might be 20B, not 10x larger

https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...

orbital-decay · on Dec 19, 2023

Let's see... the linked arXiv article has been withdrawn by the author with the following comment:

> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

The URL in question: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...

This article was written by Aleks Farseev, the CEO of SoMonitor.ai, who makes the claim with no source or explanation:

> ChatGPT is not just smaller (20 billion vs. 175 billion parameters) and therefore faster than GPT-3

moffkalast · on Dec 19, 2023

Hmm right, the ~300B figure may have been for the non-turbo 3.5

dannyw · on Dec 19, 2023

Are you sure it’s 10x smaller? I’d be surprised if OpenAI hasn’t been massively distilling their models.

paxys · on Dec 19, 2023

Has anyone (outside of Google) gotten to play with Gemini Ultra yet? Been hearing a lot about Pro, but I'd be interested in seeing whether Ultra is really close to as capable as they claim.

Also very interesting that Mixtral 8x7B ranks in the same neighborhood as Gemini Pro/GPT 3.5 Turbo/Claude 2.1 while being fully open source and Apache 2.0 licensed.

mkmk · on Dec 19, 2023

Mixtral is a mystery to me. How in the world is that team on par with/beating GOOGLE, who presumably have all the resources in the world to throw at this?

n2d4 · on Dec 19, 2023

Mixtral is on-par with Gemini Pro, not Gemini Ultra (and even there it is further behind Gemini Pro than Gemini Pro is behind GPT 3.5). But to directly answer your question, they are quite well-funded, having raised over $700mil to date. I definitely wouldn't count them out.

coder543 · on Dec 19, 2023

Mixtral ranks higher than Gemini Pro on the (subjective) Chatbot Arena Leaderboard: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Where are you seeing that it is "further behind Gemini Pro than Gemini Pro is behind GPT 3.5"?

jsnell · on Dec 19, 2023

Presumably in the very article this HN submission is for (https://arxiv.org/pdf/2312.11444.pdf), table 1.

coder543 · on Dec 19, 2023

Mixtral is missing in half of the benchmarks in that paper. Hardly conclusive. It’s also common knowledge that these benchmarks have a lot of issues[0]. A good litmus test, but not a substitute for actually seeing how the models do in the real world.

On the topic of “hardly conclusive” things, Gemini Pro literally told me just a few minutes ago[1] that the Avatar movies did not have humans in them. There was no funny business in the prompting. At least Mixtral knows that Avatar has humans in it. Most of Gemini Pro’s responses have been fine, but not exceptional.

[0]: one random article talking about these issues: https://www.surgehq.ai//blog/hellaswag-or-hellabad-36-of-thi...

[1]: https://i.imgur.com/En37EJD.png

euazOn · on Dec 19, 2023

Gemini Ultra is not out yet. With the same logic, you could compare an unreleased Mistral model with Gemini Ultra.

n2d4 · on Dec 19, 2023

Right. I'm just pointing out that comparing one model with a distilled version of another and then making broad statements about the companies behind them isn't really useful.

Surely you could make a comparison of two unreleased models, but it wouldn't be interesting because you don't have any real data (and benchmarks don't really mean anything).

jstummbillig · on Dec 19, 2023

Debating the usefulness of hn commentary is a somewhat philosophical issue, but I think it's entirely fair to draw parallels between what is, not what might be.

Gemini Ultra is self-evidently not ready for production. What the issues are? Who knows, but in a game that as of right now is mostly about reducing the amount of brute force required, something as "simple" as not being efficient enough is actually not something to gloss over. If your engines entire stick is having the greatest graphics but you can't make it run at acceptable fps, well, then it's not actually a usable product.

A LLM that is not actually released could very well be in a comparably dire state and fixing it while also delivering on the promised performance might be entirely non-trivial.

sp332 · on Dec 19, 2023

Mistral “Medium” is available (in beta, via API) and should give better results than the “Small” mixtral model.

carterschonwald · on Dec 19, 2023

My understanding, however fuzzy, is that all the safety/politeness tuning results in models that are at times less likely to give accurate responses. That said, I suspect that either way both types of models largely give similar answers for soft questions aside from those politeness and safety things

cfiggers · on Dec 19, 2023

There's a survivorship bias going on here. You've never heard of the thousands of teams out there that are Mistral's size but AREN'T getting results that compete on the global stage, but they do exist. But you've heard of Google, whether they're getting it right or not.

paxys · on Dec 19, 2023

"Thousands of teams" is a vast exaggeration. A tiny handful of companies out there have received funding to the tune of a billion dollars for model training like Mixtral. All of them have researchers with loaded resumes, and most are producing stuff of value. The thousands of other startups in the ecosystem are then taking these APIs and adding trivial abstractions on top.

pixl97 · on Dec 19, 2023

I wonder how much GPU compute time is effectively being discarded by these companies as the results turn out to be garbage?

moffkalast · on Dec 19, 2023

Mistral.AI was founded by three people from Deepmind, they're beating Google because Google no longer has them.

guyomes · on Dec 19, 2023

Slight correction, Mistral.AI was funded by two people from Meta (Guillaume Lample, Timothée Lacroix) and one from Deepmind (Arthur Mensch).

For new technologies, what matters most might be the universities where people are from, rather than the companies. The founders of Google graduated from Stanford. The founders of Mistra AI graduated from École Polytechnique and École Normale Supérieure, that are renowned in France, notably for their scientific formations.

paxys · on Dec 19, 2023

Same as OpenAI, Anthropic, Cohere, Adept and hundreds of other small-mid sized AI startups. When the dust settles and the space gets more mature the exodus from Google Brain/Deepmind over the last few years will be considered this generation's Fairchild moment.

bitshiftfaced · on Dec 19, 2023

I wonder if Google sees MoE as a sort of local maxima, and so they tried a different path hoping it might outperform it.

niklasrde · on Dec 19, 2023

Looks like Microsoft people have had access to it to benchmark promptbase: https://github.com/microsoft/promptbase

n2d4 · on Dec 19, 2023

Those benchmarks are from the Gemini Ultra announcement post

niklasrde · on Dec 19, 2023

Oops, sorry, yes you're right, I think. I was skimming the lists of prompts incorrectly.

SeanAnderson · on Dec 19, 2023

The hosts of the All In podcast have used it, but they're billionaires. They think highly of Ultra. Early on they just talk about the paper that's released, but then they drop that they've used Ultra.

https://www.youtube.com/watch?v=IeKUcpU5-Xk&t=3667s

paxys · on Dec 19, 2023

Started watching the video wondering when it would turn into a rant about "wokeness" and "cancel culture", and it happened about 30 seconds in. Glad to see these guys haven't changed.

helsinkiandrew · on Dec 19, 2023

One thing this doesn't cover is speed/latency of response. Some of the things I'm doing on ChatGPT could probably work on a 90% capable GPT3.5 and other factors become more of an issue.

If I was already using GCP and they reduced their price (>10%) and offered tight integration with rest of GCP services it would still be appealing.

GaggiX · on Dec 19, 2023

Gemini Pro API charges you by char instead of by token, I guess it's good news for Chinese/Japanese users and other speakers of less common languages, for example the OpenAI tokenizer needs two tokens to encode a single Georgian letter.

m3kw9 · on Dec 19, 2023

Starling 7b alpha is close to gpt3.5 but it asking list 5 capitals sorted by population size and it will run on like forever generating 100 capitals and explainations and gibrish. Not sure if it’s my prompt, but it sure isn’t close to the intelligence of gpt3.5 when asked the same.

Starling:

```

USER list 5 largest capitals sorted by largest population first

AI Here are some countries and their capitals in alphabetical order Australia: Canberra Belgium: Brussels Canada: Ottawa Finland: Helsinki Italy: Rome France: Paris Germany: Berlin Greece: Athens Ireland: Dublin …(lists 50 others)

Gpt3.5

list 5 largest capitals sorted by largest population first

ChatGPT Certainly! To list the 5 largest capital cities by population, I'll provide you with the information as of my knowledge cutoff date in 2022. Please note that population figures may change over time, and it's a good idea to verify the latest data from reliable sources if needed. Here are the 5 largest capital cities sorted by population, starting with the largest:

Beijing, China: Beijing is the capital of China and has a population of over 21 million people as of 2022. New Delhi, India: New Delhi is the capital of India and is part of the larger National Capital Territory of Delhi. It has a population of approximately 16.8 million people as of 2022. Tokyo, Japan: Tokyo is the capital of Japan, and while the city itself is not officially designated as the capital, it serves as the de facto capital. The Tokyo metropolitan area has a population of around 14 million people. Moscow, Russia: Moscow is the capital of Russia and has a population of over 12 million people. Dhaka, Bangladesh: Dhaka is the capital of Bangladesh and has a population of approximately 8.9 million people. ```

vczf · on Dec 19, 2023

Sampling methods also affect this. Have you tried min_p sampling? https://github.com/ggerganov/llama.cpp/pull/3841#issuecommen...

dang · on Dec 19, 2023

Submitters: "Please use the original title, unless it is misleading or linkbait; don't editorialize." - https://news.ycombinator.com/newsguidelines.html

(Submitted title was "Gemini Pro achieves accuracy slightly inferior to GPT 3.5 Turbo".)

If you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

jiggawatts · on Dec 20, 2023

Does anyone else have the sinking feeling that GPT 4 is as good as things will get for quite a while?

Someone described LLMs as “blurry JPEGs of the Internet”.

In that sense, maybe GPT 4 is as smart as the hive mind of the Internet gets, and newer models just take sharper pictures but of the same subject. Perhaps GPT 4 trained on one of the best subsets available and everything else is going to be worse or the same…

It’s curious that Sam Altman has publicly stated that OpenAI isn’t working on GPT 5. Why not? Is it because they know it’s a pointless exercise with the current training approaches?

lsy · on Dec 19, 2023

I don't think "accuracy" is going to be the defining feature of which chatbot succeeds. People just aren't using them for tasks where a 3-5 point difference makes the grade, because the difference between 67 and 100 is more important than the difference between 64 and 67. If you can integrate a relatively speedy bot somewhere people can use it conveniently that'll get more usage than a slightly more factual response you have to tab out to.

we_love_idf · on Dec 19, 2023

I don't understand why people keep falling for Google's ad campaign. Google have its lead in AI playing video games and board games. It is cool, entertaining and all that jazz. But OpenAI and MS are the real leaders in real AI.

myko · on Dec 19, 2023

https://arxiv.org/abs/1706.03762 https://research.google/pubs/attention-is-all-you-need/

let's not forget where this breakthrough came from, i wouldn't count Google out

czl · on Dec 20, 2023

I hope you are correct yet Xerox and Kodak paid for plenty of breakthroughs, did they not? Google faces a similar "innovator's dilemma" challenge: successful companies often struggle to embrace new technologies or business models that could disrupt their existing, successful products.

nicklecompte · on Dec 19, 2023

This is a very ignorant comment. AlphaFold is far more useful than ChatGPT and remains a more impressive piece of technology, even if it doesn't help you write boilerplate code.

sidibe · on Dec 19, 2023

Even if you don't think Google doesn't have a talent or product chops to be leaders in AI, Google can do things cheaper than others because of their infrastructure. When they do release something useful they'll probably be able to offer it free and force it on people on the most visited pages/most used browser. Surprised how many people think having a years head start means OpenAI and Microsoft are going to always be ahead

visarga · on Dec 19, 2023

I have been waiting for Google to release a decent translation model, but it seems they will only offer a cheap one. DeepL has been besting Google Translate for years. Why? I think Google has aversion to making large models public.

vitorgrs · on Dec 23, 2023

I don't even think Gemini Ultra will be free. They said it will be offered on "Bard Advanced", so not standard Bard... Which makes me thinking that it might be paid.

jacquesm · on Dec 19, 2023

> they'll probably be able to offer it free

Why do you way they will 'probably' do that? Do you have any information to back that up or is this your speculation?

charlie0 · on Dec 19, 2023

It's because no one can see Google doing this without gimping it to prevent it from cannibalizing ads that gives OpenAI staying power.

jimsimmons · on Dec 19, 2023

The Gemini white paper reports higher scores on HumanEval and other tasks.

So one of Google lied, this eval has bugs, they borked the deployment is true

we_love_idf · on Dec 19, 2023

Most likely Google has lied. AI playing video games and board games don't translate to real world applications. Many people fail to see that.

zktruth · on Dec 19, 2023

In what respect is generating text a better predictor of real world applicability than the ability to achieve goals in a complex simulated environment containing other agents?

visarga · on Dec 19, 2023

It's not one or the other. We need both supervised pre-training and reinforcement learning. The first part represents past human experiences encoded as language. They can bring a model to human level on most tasks, but not make it smarter.

The second approach, with RL, is based on immediate feedback and could make a model smarter than us. Just think of AlphaZero or AlphaTensor. But this requires deploying a wide search over possible solutions and using a mechanism to rank or filter the bad ideas out (code execution, running a simulation or a game, optimizing some metric)

So models need both past experience and new experience to advance. They can use organic text initially, but later need to develop their own training examples. The feedback they get will be on topic, both with the human user and with the model mistakes. That's very valuable. Feedback learning is what could make LLMs finally graduate from mediocre results.

DeepMind is saying they are using both, and feedback learning is dialed up.

yayr · on Dec 19, 2023

The context in simulated environments of games is far less complex than the real world. Also the available interactions far less. It would be different if the agent would be exposed to the real world and use multisensory data to predict the next "token", i.e. thought or action.

peyton · on Dec 19, 2023

People pay money for it.