Hacker Newsnew | past | comments | ask | show | jobs | submit | more coder543's commentslogin

Well, to start with, there is no regular 3B Gemma. There are 2B and 7B Gemma models. I would guess this model is adding an extra 1B parameters to the 2B model to handle visual understanding.

The 2B model is not very smart to begin with, so… I would expect this one to not be very smart either if you only use it for text, but I wouldn’t expect it to be much worse. It could potentially be useful/interesting for simple visual understanding prompts.


Have you tried groq.com? Because I don't think gpt-4o is "incredibly" fast. I've been frustrated at how slow gpt-4-turbo has been lately, and gpt-4o just seems to be "acceptably" fast now, which is a big improvement, but still, not groq-level.


That is Falcon 1, not Falcon 2.

Falcon 1 is entirely obsolete at this point, based on every benchmark I've seen.


Human preference data from side by side, anonymous comparisons of models: https://leaderboard.lmsys.org/

Llama3 8B significantly outperforms ChatGPT-3.5, and LLama3 70B is significantly better than that. These are ELO ratings, so it would not be accurate to try to say X is 10% better than Y because the score is 10% higher.

Obviously Falcon 2 is too new to be on the leaderboard yet.

Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.


Human preference does not always favor the model that is best at reasoning/code/accuracy whatever. In particular there's a recent article suggesting that Llama 3's friendly and direct chattiness contributes to it having a good standing in the leaderboard.

https://lmsys.org/blog/2024-05-08-llama3/


Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.

If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.

The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.


Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).


I find that kind of surprising; the lack of “customer service voice” is one of the main reasons I prefer the Mistral models over Open AI’s, even if the latter are somewhat better at complex/specific tasks.


> Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.

Yet I guarantee you that ChatGPT-3.5 has 95% of the "direct to consumer" marketshare.

Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

Now onto the ranking, I perfectly recognized in my original comment that those comparisons exist, just that they're not highlighted properly in any launch announcement of any new model.

I haven't used Llama, only ChatGPT and the multiple versions of Claude 2 and 3. How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?


> Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

> How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?

You're not. These press releases are for the "technical users" that have heard of and used all of these alternatives.

They are not offering a Falcon 2 chat service you can use today. They aren't even offering a chat-tuned Falcon 2 model. The Falcon 2 model in question is a base model, not a chat model.

Unless someone is very technical, Falcon 2 is not relevant to them in any way at this point. This is a forum of technical people, which is why it's getting some attention, but I suspect it's still not going to be relevant to most people here.


Keep in mind that this is a comparison of base models, not chat tuned models, since Falcon-11B does not have a chat tuned model at this time. The chat tuning that Meta did seems better than the chat tuning on Gemma.

Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.

CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.


The model type is a good point. It's hard to track all the variables in this very fast paced field.

Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.


for me, CodeGemma is super slow. I'd say 3-4 times slower than llama3. I am also looking forward to CodeLlama3 but I have a feeling Meta can't improve on llama3 it. Was there anything official from Meta?


CodeGemma has fewer parameters than Llama3, so it absolutely should not be slower. That sounds like a configuration issue.

Meta originally released Llama2 and CodeLlama, and CodeLlama vastly improved on Llama2 for coding tasks. Llama3-8B is okay at coding, but I think CodeGemma-1.1-7b-it is significantly better than Llama3-8B-Instruct, and possibly a little better than Llama3-70B-Instruct, so there is plenty of room for Meta to improve Llama3 in that regard.

> Was there anything official from Meta?

https://ai.meta.com/blog/meta-llama-3/

"The text-based models we are releasing today are the first in the Llama 3 collection of models."

Just a hint that they will be releasing more models in the same family, and CodeLlama3 seems like a given to me.


I suppose it could be quantization issue, but both are done by lmstudio-community. Llama3 does have a different architecture and bigger tokenizer which might explain it.


You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.


You're right, ollama does perform the same on both models. Thanks.


This blog post I saw recently might be relevant: https://refact.ai/blog/2024/fine-tuning-on-htmlx-making-web-...


Since the images in the article are from infrared cameras, blue-shifting the light might just land the view from those IR images into the visible spectrum for the observer! Just need to tune the speed correctly.


> As far as I know, the most RAM you can get in a MacBook Pro (which is what you said you're shopping for) is 48G

Huh? They have options up to 128GB…

https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...


An unquantized Qwen1.5-110B model would require some ~220GB of RAM, so 100+GB would not be "enough" for that, unless we put a big emphasis on the "+".

I consider "heavily" quantized to be anything below 4-bit quantization. At 4-bit, you could run a 110B model on around 55GB to 60GB of memory. Right now, Llama-3-70B-Instruct is the highest ranked model you can download[0], and you should be able to fit the 6-bit quantization into 64GB of RAM. Historically, 4-bit quantization represents very little quality loss compared to the full 16-bit models for LLMs, but I have heard rumors that Llama 3 might be so well trained that the quality loss starts to occur earlier, so 6-bit quantization seems like a safe bet for good quality.

If you had 128GB of RAM, you still couldn't run the unquantized 70B model, but you could run the 8-bit quantization in a little over 70GB of RAM. Which could feel unsatisfying, since you would have so much unused RAM sitting around, and Apple charges a shocking amount of money for RAM.

[0]: https://leaderboard.lmsys.org/


However if you want to use the LLM in your workflow instead of just experimenting with it on its own you also need RAM to run everything else comfortably.

96GB RAM might be a good compromise for now. 64GB is cutting it close, 128GB leaves more breathing room but is expensive.


Yep, I agree with that.


Phi 3 Q4 spazzes out on some inputs (emits a stream of garbage), while the FP16 version doesn't (at least for the cases I could find). Maybe they just botched the quantization (I have good results with other Q4 models), but it is an interesting data point.


Phi 3 in particular had some issues with the end-of-text token not being handled correctly at launch, as I recall, but I could be remembering incorrectly.


Firstly, I'll say that it's always exciting to see more weight-available models.

However, I don't particularly like that benchmark table. I saw the HumanEval score for Llama 3 70B and immediately said "nope, that's not right". It claims Llama 3 70B scored only 45.7. Llama 3 70B Instruct[0] scored 81.7, not even in the same ballpark.

It turns out that the Qwen team didn't benchmark the chat/instruct versions of the model on virtually any of the benchmarks. Why did they only do those benchmarks for the base models?

It makes it very hard to draw any useful conclusions from this release, since most people would be using the chat-tuned model for the things those base model benchmarks are measuring.

My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.

[0]: https://scontent-atl3-1.xx.fbcdn.net/v/t39.2365-6/438037375_...


I'd recommend those looking for local coding models to go for code-specific tunes. See the EvalPlus leaderboard (HumanEval+ and MBPP+): https://evalplus.github.io/leaderboard.html

For those looking for less contamination, the LiveCodeBench leaderboard is also good: https://livecodebench.github.io/leaderboard.html

I did my own testing on the 110B demo and didn't notice any cross-lingual issues (which I've seen with the smaller and past Qwen models), but for my personal testing, while the 110B is significantly better than the 72B, it doesn't punch above its weight (and doesn't perform close to Llama 3 70B Instruct from my testing). https://docs.google.com/spreadsheets/d/e/2PACX-1vRxvmb6227Au...


humaneval is generally a very poor benchmark imo and I hate that it's become the default "code" benchmark in any model release. I find it more useful to just look at MMLU as a ballmark of model ability and then just vibe checking it myself on code.

source: I'm hacking on a high performance coding copilot (https://double.bot/) and play with a lot of different models for coding. Also adding Qwen 110b now so I can vibe check it. :)


Didn't Microsoft use HumanEval as the basis for developing Phi? If so I'd say it works well enough! (At least Phi 3, haven't tested the others much.)

Though their training set is proprietary, it can be leaked by talking with Phi 1_5 about pretty much anything. It just randomly starts outputting the proprietary training data.


Humaneval was developed for codex I believe:

https://arxiv.org/abs/2107.03374


I agree HumanEval isn't great, but I've found that it is better than not having anything. Maybe we'll get better benchmarks someday.

What would make "Double" higher performance than any other hosted system?


no this is different. it is for the base model. this is why i explain in my tweet that we just say for the base model quality we might be comparable. for instruct model, there is much room to improve especially on human eval.

i admit that the code switching is a serious problem of ours cuz it really affects the user experience of english users. but we find that it is hard for a multilingual model to get rid of this feature. we'll try to fix it in qwen2.


> My previous experience with Qwen releases is that the models also have a habit of randomly switching to Chinese for a few words. I wonder if this model is better at responding to English questions with an English response? Maybe we need a benchmark for how well an LLM sticks to responding in the same language as the question, across a range of different languages.

This is trivially resolved with a properly configured sampler/grammar. These LLMs output a probability distribution of likely next tokens, not single tokens. If you're not willing to write your own code, you can get around this issue with llama.cpp, for example, using `--grammar "root ::= [^一-鿿ぁ-ゟァ-ヿ가-힣]*"` which will exclude CJK from sampled output.


That's funny you mentioned switching to another language, I recently asked chatGPT "translate this: <random german sentence>" And it translated the sentence in french, while I was speaking with it in english"


I see the science fiction meme of AI giving sassy, technically correct but useless answers is grounded in truth.


By ChatGPT, do you mean ChatGPT-3.5 or ChatGPT-4? No one should be using ChatGPT-3.5 in an interactive chat session at this point, and I wish OpenAI would recognize that their free ChatGPT-3.5 service seems like it is more harmful to ChatGPT-4 and OpenAI's reputation than it is helpful, just due to how unimpressive ChatGPT-3.5 is compared to the rest of the industry. You're much better off using Google's free Gemini or Meta's Llama-3-powered chat site or just about anything else at this point, if you're unwilling to pay for ChatGPT-4.

I am skeptical that ChatGPT-4 would have done what you described, based on my own extensive experience with it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: