Exactly. Falcon-180b had a lot of hype at first but the community soon realized it was nearly worthless. Easily outperformed by smaller LLMs in the general case.
Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:
- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class
- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim
- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it
- clickbait headline with knowingly premature claims because there has been zero human evaluation testing
- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b
Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.
Seems to be the case with all their models - really huge in size, no actual performance gains for the effort.
Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.
So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.
What? Falcon-7B base model is pretty much one of the only few small models that'll happily write a whole coherent fanfic all the way to the end without getting stuck in a loop right before the explicit content.
True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?
> True, the model is bigger, but required less tokens than Llama 3 to train.
That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.
Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.
I understand, I'm just glad for the possible implications for future models: less expensive to make => less expensive to iterate. MoE are cheaper to train. My favorite right now is Wizard 8x22b, so as a random user, I don't really care about this model. Will probably never run it as-is. But makes me hope for a Falcon-MoE.
Also, the fact that it's less dense than llama 3 means there may be more room for lora fine-tuning, and at a lesser cost than required for llama 3 while sacrificing way less of its smarts. That may be my use.
But... that modified Apache 2 license says the following:
"The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."
So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!
Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.
Yes and No. If the full text of the license is "You must contact us for written approval before doing anything with this product" then yes that can be enforced.
But a document like this which basically has a bunch of words followed by a "Just kidding" line, are not enforceable, because it contradicts the previous language. A judge would throw the whole thing out because it doesn't meet the standard of a contract.
Which matters because training an LLM very likely is not protected by copyright.
While for a copyrighted work the default is that you can’t use it unless you have a valid license, for an LLM the default is that you can use it unless you have signed a contract restricting what you can do in return for some consideration.
I don’t think these contracts are designed to be enforced because an attempt to enforce it would reveal it to be hot air, they are just there to scare people into compliance.
Better to download LLMs from unofficial sources to avoid attempted contract shenanigans.
A large company I work for declined to buy licenses from a supplier that had this clause. From my understanding it's not really legal - or at worst it's a gray area - but the language is just too risky if you're not looking to be the test case in court.
Exactly. It’s not about whether it’s enforceable, it’s about the fact that a sketchy license indicates that they might not always operate in good faith.
It unfortunately isn't worded that it only affects new usage. If you needed to check once before your initial use that would be shady but unquestionably legal as the terms of the contract are clear when you are entering into the agreement.
This clause allows them to arbitrarily change the contract with you at will, with no notice. That _shouldn't_ be enforceable but AFAIK that kind of contract has never been tested. It is _likely_ unenforceable though.
Why shouldn't that be enforceable? "Permission to use this software may be withdrawn at any time" seems like it would be an enforceable clause; why would this be less so?
I suppose a better question would be "how drastic can the change be to the license?" because by adding that term, you're basically superseding every other term on the license. How do licenses deal with contradictory terms if that's even possible?
The license is explicit that it can be updated unilaterally. Nobody can adopt this software and claim not to know that's a possibility. There are attorneys specializing in open source licenses who comment on HN regularly, and maybe they'll surprise us, but, as a non-lawyer nerding out on this stuff I put all my chips down on "this is legally fine".
The two reasons I think it's not that black and white is:
1. This brings up the question of being able to agree to a contract before the contract is written which makes no sense.
2. If it's legal then why don't all companies do it. Instead, companies like Google regularly put out updated terms of service which you have to agree to before continuing to use their service. Often times you don't realize it because it's just another checkbox or button to click before signing in.
Imagine if a landlord sued a tenant after 1 year for a new roof of an apartment because the contract stated that "tenant will be responsible for all repairs" but the tenant pointed out that the contract also said "the house is in perfect and new condition".
Both things cannot be true, so the judge throws it out.
Same thing here, you can't grant someone a license to use something and then immediately say, "You can't use this without checking with us first" it is contradictory.
> So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!
I'm so curious if this would actually hold up in court. Does anyone know if there's any case law / precedence around this?
Of course, projects change their licences all the time, why wouldn't it be legal? There's a long history of startups who started with open source/open core gradually closing off or commercialising the licence. This isn't anything new at all.
This is why it's good to read licenses before adopting the tech, especially if it's at all core to your business/project.
No. Projects sometimes stop offering the previous license and start using a different one for new work.
But if your project is Apache-2, you cannot take away someone's license after the fact. You can only stop giving away new Apache-2 licenses from that point on.
The difference here is the license itself has mystery terms that can change at any time. That, is very much not done all the time.
Releasing a new version with a new license is not the same as retroactively changing the license terms for copies already distributed of existing versions.
Projects change license for new code going forward. The old code remains available under the previous license (and sometimes new). Here, they are able to change the conditions for existing weights.
Not the first time they did some license shenanigans (happened with Falcon 1). I applaud their efforts but it seems they are still trying to figure out if/how to monetize.
I doubt the Emiratis have much interest in monetisation. The value they’re probably looking for is in LLMs as a media asset.
Just like Al Jazeera is valuable for Qataris and sports are valuable for the Saudis. These assets create goodwill domestically and with international stakeholders. Sometimes they make money, but that’s secondary.
If people spend a few hours a day talking to LLMs there’s some media value there.
They may also fear that Western models would be censored or licensed in a way harmful to UAE security and cultural objectives. Imagine if Llama 4’s license prevented military use without approval by some American agency.
When it is backed by the UAE the muscle you have to contend with is not simply legal muscle, it also includes armed muscle of questionable moral fibre (see support for the RSF).
they're not going to send their army after you, but that's not what this means.
friendly middle-eastern countries negotiate all kinds of concessions from western governments in exchange for allowing military operations to stage in their country, and "we want you to enforce our IP laws" is an easy one for western governments to grant.
And even then you aren't retroactively making it more open. You are just now offering an additional, more open license as well as the existing one.
You haven't taken the original license away, you just provided a better default option.
The same weirdly enough goes in reverse as well. You can provide a more restrictive license retroactively even if the rights holders don't consent as long as the existing license is compatible with the new, more restrictive license. i.e. you can promote a work from Apache-2.0 to GPL-3.0-or-later as the former is fully compatible with the latter. However you can't stop existing users from using it as Apache-2.0, you can only stop offering it yourself with that license (but anyone who has an existing Apache-2.0 copy or who is an original rights holder can freely distribute it).
Keep in mind that this is a comparison of base models, not chat tuned models, since Falcon-11B does not have a chat tuned model at this time. The chat tuning that Meta did seems better than the chat tuning on Gemma.
Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.
CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.
The model type is a good point. It's hard to track all the variables in this very fast paced field.
Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.
for me, CodeGemma is super slow. I'd say 3-4 times slower than llama3.
I am also looking forward to CodeLlama3 but I have a feeling Meta can't improve on llama3 it. Was there anything official from Meta?
CodeGemma has fewer parameters than Llama3, so it absolutely should not be slower. That sounds like a configuration issue.
Meta originally released Llama2 and CodeLlama, and CodeLlama vastly improved on Llama2 for coding tasks. Llama3-8B is okay at coding, but I think CodeGemma-1.1-7b-it is significantly better than Llama3-8B-Instruct, and possibly a little better than Llama3-70B-Instruct, so there is plenty of room for Meta to improve Llama3 in that regard.
I suppose it could be quantization issue, but both are done by
lmstudio-community. Llama3 does have a different architecture and bigger tokenizer which might explain it.
You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.
Anecdotal, I know, but in my experience Gemma is absolutely worthless and Llama 3 8b is exceptionally good for its size. The idea that Gemma is ahead of Llama 3 is bizarre to me. Surely there’s some contamination or something if Gemma is showing up ahead in some benchmarks‽
Adding more anecdata, but this has been exactly my experience as well. I haven't dug into details about the benchmarks, but just trying to use the things for basic question asking, Llama 3 is so much better it's like comparing my Milwaukee drill to my sons Fisher Price plastic toy drill.
Sigh, I thought this was going to be about Spectrum Holobyte’s Falcon AT. From MyAbandonware.com:
> Essentially Falcon 2 but somehow marketed differently, Falcon AT is the second release in Spectrum Holobyte's revolutionary hard-core flight sim Falcon series. Despite popular belief that Falcon 3.0 was THE dawn of modern flight sims, Falcon AT actually is already a huge leap over Falcon, sporting sharp EGA graphics, and a lot of realistic options and greatly expanded campaigns. The game is still the simulation of modern air combat, complete with excellent tutorials, varied missions, and accurate flight dynamics that Falcon fans have come to know and love. Among its host of innovations is the amazingly playable multiplayer options -- including hotseat and over the modem. Largely forgotten now, Falcon AT serves to explain the otherwise inexplicable gap between Falcon and Falcon 3.0.
There seems to be a trend of people naming new things (perhaps unintentionally) after classic computer games. We just had a post here on a system called Loom which apparently isn't the classic adventure game. I'm half expecting someone to come up with an LLM or piece of networking software and name it Zork.
It doesn't help any that currently "F-16 Strike Eagle II reverse engineering" <https://news.ycombinator.com/item?id=40347662> is also on the front page, "priming" one to think similarly
I welcome open models, although the Falcon model is not super open, as noted here. I will say that the original Falcon did not perform as well as its benchmark stats indicated -- it was pushed out as a significant leap forward, and I didn't find it outperformed competitive open models at release.
The PR stating an 11B model outperforms 7B and 8B models 'in the same class' feels like it might be stretching a bit. We'll see -- I'll definitely give this a go for local inference. But, my gut is that finetuned llama 3 8B is probably best in class...this week.
> I will say that the original Falcon did not perform as well as its benchmark stats indicated
Yea I saw that as well. I believe it was undertrained in terms of parameters vs tokens because they really just wanted to have a 40bn parameter model (like pre chinchilla optimal)
It's hard to know if there's any special sauce here, but the internet so far has decided "meh" on these models. I think it's an interesting choice to put it out as tech competitive. Stats say this one was trained on 5T tokens. For reference, Llama 3 so far was reported at 15T.
There is no way you get back what you lost in training by expanding parameters 3B.
If I were in charge of UAE PR and this project, I'd
a) buy a lot more H100s and get the training budget up
b) compete on a regional / messaging / national freedom angle
c) fully open license it
I guess I'm saying I'd copy Zuck's plan, with oil money instead of social money and play to my base.
Overstating capabilities doesn't give you a lot of benefit out of a local market, unfortunately.
These reminders that AI will not only be wielded by democracies with (at least partial attempts at) ethical oversight, but also by the worst of the worst autocrats, are truly chilling.
MBZ (note MBZ is not MBS; Saudia Arabia and UAE are two different countries!) is one of the most popular leaders in the world and his people among the wealthiest. His country is one of the few developed countries in the world where the economy is still growing steadily, and one of the safest countries in the world outside of East Asia, in spite of having one of the world's most liberal immigration policies. Much more a contender for the best of the best autocrats than the worst of the worst.
I want to understand something: the model was trained on mostly a public dataset(?), with hardware from AWS, using well-known algorithms and techniques. How is it different from other models that anyone that has the money can train?
My skeptic/hater(?) mentality, sees this as only a "flex" and an effort to try be seen as relevant. Is there more to this kind of effort that I'm not seeing?
A lot of models are in this category. Sovereignty (whether national or corporate) has some value. And the threat of competition is a good thing for everyone. I'm glad people are working on these even if the end result in most cases isn't anything particularly interesting.
I know it's hard to objectively rank LLMs, but those are really ridiculous ways to keep track of performance.
If my reference of performance is (like the vast majority of users) ChatGPT-3.5, I have to first know how Llama 3 compares to that to then understand how that new models compare to what I'm using at the moment.
Now, if I look for the performance of Llama 3 compared to ChatGPT-3.5, I don't find it on the official launch page https://ai.meta.com/blog/meta-llama-3/ where it is compared to Gemma 7B it, Mistral 7B Instruct, Gemini Pro 1.5 and Claude 3 Sonnet.
Let's look at the Llama 2 performance on its launch announcement: https://llama.meta.com/llama2/ No GPT-3.5 turbo again.
I get that there are multiple aspects and that there's probably not one overall "performance" metric across all tasks, and I get that you can probably find a comparative between two specific models relatively easily, but there absolutely needs to be a standard by which those performances are communicated. The number of hoops to jump through is ridiculous.
Llama3 8B significantly outperforms ChatGPT-3.5, and LLama3 70B is significantly better than that. These are ELO ratings, so it would not be accurate to try to say X is 10% better than Y because the score is 10% higher.
Obviously Falcon 2 is too new to be on the leaderboard yet.
Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.
Human preference does not always favor the model that is best at reasoning/code/accuracy whatever. In particular there's a recent article suggesting that Llama 3's friendly and direct chattiness contributes to it having a good standing in the leaderboard.
Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.
If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.
The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.
Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).
I find that kind of surprising; the lack of “customer service voice” is one of the main reasons I prefer the Mistral models over Open AI’s, even if the latter are somewhat better at complex/specific tasks.
> Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.
Yet I guarantee you that ChatGPT-3.5 has 95% of the "direct to consumer" marketshare.
Unless you're a technical user, you haven't even heard about any alternative, let alone used them.
Now onto the ranking, I perfectly recognized in my original comment that those comparisons exist, just that they're not highlighted properly in any launch announcement of any new model.
I haven't used Llama, only ChatGPT and the multiple versions of Claude 2 and 3. How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?
> Unless you're a technical user, you haven't even heard about any alternative, let alone used them.
> How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?
You're not. These press releases are for the "technical users" that have heard of and used all of these alternatives.
They are not offering a Falcon 2 chat service you can use today. They aren't even offering a chat-tuned Falcon 2 model. The Falcon 2 model in question is a base model, not a chat model.
Unless someone is very technical, Falcon 2 is not relevant to them in any way at this point. This is a forum of technical people, which is why it's getting some attention, but I suspect it's still not going to be relevant to most people here.
"When tested against several prominent AI models in its class among pre-trained models, Falcon 2 11B surpasses the performance of Meta’s newly launched Llama 3 with 8 billion parameters (8B), and performs on par with Google’s Gemma 7B at first place, with a difference of only 0.01 average performance (Falcon 2 11B: 64.28 vs Gemma 7B: 64.29) according to the evaluation from Hugging Face."
So Falcon 2 with 11B params outperform Llama 3 8B? With more parameters that doesn’t make a fair comparison. The strongest open source model seems to be Llama 3 70B, why claim outperforming Llama 3 when you didn’t outperform the best model?
I guess that explains why OpenAI are rushing to make their models free despite having paying users. They don't want to lose market share to local LLMs just yet.
Time is coming when you'll be able to select out of hundreds of models and they'll be downloaded on demand if not already, would totally infer locally and offline even on your phone and I guess no more than 2030, we'll be there.
I am not so up to date with the hardware landscape but don't think smart people would let not be noticing the need.
Cloud models will always have edge compared to local models. Maybe in 2030 your iPhone would run GPT-4 on your phone but cloud GPT-9 will solve all your kids homework, do 95% of your job and manage your household.
Transformers still didn't hit their scaling limits wrt number of tokens trained on and the number of layers. Model size limits are now purely given by finances ($100M/training run still seems to be too excessive).
Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.
https://huggingface.co/tiiuae/falcon-11B
https://huggingface.co/meta-llama/Meta-Llama-3-8B
https://mistral.ai/news/announcing-mistral-7b/