Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Falcon 2 (tii.ae)
225 points by tosh on May 13, 2024 | hide | past | favorite | 108 comments


Their benchmark results seem roughly on par with Mistral 7B and Llama 3 8B, which hardly seems that great given the increase in model size.

https://huggingface.co/tiiuae/falcon-11B

https://huggingface.co/meta-llama/Meta-Llama-3-8B

https://mistral.ai/news/announcing-mistral-7b/


Exactly. Falcon-180b had a lot of hype at first but the community soon realized it was nearly worthless. Easily outperformed by smaller LLMs in the general case.

Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:

- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class

- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim

- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it

- clickbait headline with knowingly premature claims because there has been zero human evaluation testing

- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b

Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.


Seems to be the case with all their models - really huge in size, no actual performance gains for the effort.

Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.

So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.


What? Falcon-7B base model is pretty much one of the only few small models that'll happily write a whole coherent fanfic all the way to the end without getting stuck in a loop right before the explicit content.

Ignore instruct tunes.


Maybe that's not the right metrics to compare.

True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?


> True, the model is bigger, but required less tokens than Llama 3 to train.

That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.

Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.


I understand, I'm just glad for the possible implications for future models: less expensive to make => less expensive to iterate. MoE are cheaper to train. My favorite right now is Wizard 8x22b, so as a random user, I don't really care about this model. Will probably never run it as-is. But makes me hope for a Falcon-MoE.

Also, the fact that it's less dense than llama 3 means there may be more room for lora fine-tuning, and at a lesser cost than required for llama 3 while sacrificing way less of its smarts. That may be my use.


The license is not good: https://falconllm-staging.tii.ae/falcon-2-terms-and-conditio...

It's a modified Apache 2 license with extra clauses that include a requirement to abide by their acceptable use policy, hosted here: https://falconllm-staging.tii.ae/falcon-2-acceptable-use-pol...

But... that modified Apache 2 license says the following:

"The Acceptable Use Policy may be updated from time to time. You should monitor the web address at which the Acceptable Use Policy is hosted to ensure that your use of the Work or any Derivative Work complies with the updated Acceptable Use Policy."

So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!

Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.


So basically you can never use this for anything non-trivial because they can deny your use-case at any time without even notifying you.


Are licenses that can be altered after the fact even legal? Feels like more companies would use them if it was...


Yes and No. If the full text of the license is "You must contact us for written approval before doing anything with this product" then yes that can be enforced.

But a document like this which basically has a bunch of words followed by a "Just kidding" line, are not enforceable, because it contradicts the previous language. A judge would throw the whole thing out because it doesn't meet the standard of a contract.


Which matters because training an LLM very likely is not protected by copyright.

While for a copyrighted work the default is that you can’t use it unless you have a valid license, for an LLM the default is that you can use it unless you have signed a contract restricting what you can do in return for some consideration.

I don’t think these contracts are designed to be enforced because an attempt to enforce it would reveal it to be hot air, they are just there to scare people into compliance.

Better to download LLMs from unofficial sources to avoid attempted contract shenanigans.


Do you inspect the files manually or with a tool if you download from unofficial sources as a verification?


A large company I work for declined to buy licenses from a supplier that had this clause. From my understanding it's not really legal - or at worst it's a gray area - but the language is just too risky if you're not looking to be the test case in court.


Exactly. It’s not about whether it’s enforceable, it’s about the fact that a sketchy license indicates that they might not always operate in good faith.


Why wouldn't they be? The starting presumption for a new work is that you have no permission to use it at all.


It unfortunately isn't worded that it only affects new usage. If you needed to check once before your initial use that would be shady but unquestionably legal as the terms of the contract are clear when you are entering into the agreement.

This clause allows them to arbitrarily change the contract with you at will, with no notice. That _shouldn't_ be enforceable but AFAIK that kind of contract has never been tested. It is _likely_ unenforceable though.


Why shouldn't that be enforceable? "Permission to use this software may be withdrawn at any time" seems like it would be an enforceable clause; why would this be less so?


How about you use my service once now and in the future I might decide that the cost of that usage should be double?


Then you stop using the service.


I suppose a better question would be "how drastic can the change be to the license?" because by adding that term, you're basically superseding every other term on the license. How do licenses deal with contradictory terms if that's even possible?


The license is explicit that it can be updated unilaterally. Nobody can adopt this software and claim not to know that's a possibility. There are attorneys specializing in open source licenses who comment on HN regularly, and maybe they'll surprise us, but, as a non-lawyer nerding out on this stuff I put all my chips down on "this is legally fine".


The two reasons I think it's not that black and white is:

1. This brings up the question of being able to agree to a contract before the contract is written which makes no sense.

2. If it's legal then why don't all companies do it. Instead, companies like Google regularly put out updated terms of service which you have to agree to before continuing to use their service. Often times you don't realize it because it's just another checkbox or button to click before signing in.


There are all sorts of contracts that can be unilaterally terminated.


Judges don't enforce bad contracts.

Imagine if a landlord sued a tenant after 1 year for a new roof of an apartment because the contract stated that "tenant will be responsible for all repairs" but the tenant pointed out that the contract also said "the house is in perfect and new condition".

Both things cannot be true, so the judge throws it out.

Same thing here, you can't grant someone a license to use something and then immediately say, "You can't use this without checking with us first" it is contradictory.


How will they know you're using it?


Some models may have watermarks

Though it’s easy to workaround if you know it’s there


> So no matter what you think of their current AUP they reserve the right to update it to anything they like in the future, and you'll have to abide by the new one!

I'm so curious if this would actually hold up in court. Does anyone know if there's any case law / precedence around this?


Of course, projects change their licences all the time, why wouldn't it be legal? There's a long history of startups who started with open source/open core gradually closing off or commercialising the licence. This isn't anything new at all.

This is why it's good to read licenses before adopting the tech, especially if it's at all core to your business/project.


No. Projects sometimes stop offering the previous license and start using a different one for new work.

But if your project is Apache-2, you cannot take away someone's license after the fact. You can only stop giving away new Apache-2 licenses from that point on.

The difference here is the license itself has mystery terms that can change at any time. That, is very much not done all the time.


Releasing a new version with a new license is not the same as retroactively changing the license terms for copies already distributed of existing versions.


Projects change license for new code going forward. The old code remains available under the previous license (and sometimes new). Here, they are able to change the conditions for existing weights.


Not the first time they did some license shenanigans (happened with Falcon 1). I applaud their efforts but it seems they are still trying to figure out if/how to monetize.


I doubt the Emiratis have much interest in monetisation. The value they’re probably looking for is in LLMs as a media asset.

Just like Al Jazeera is valuable for Qataris and sports are valuable for the Saudis. These assets create goodwill domestically and with international stakeholders. Sometimes they make money, but that’s secondary.

If people spend a few hours a day talking to LLMs there’s some media value there.

They may also fear that Western models would be censored or licensed in a way harmful to UAE security and cultural objectives. Imagine if Llama 4’s license prevented military use without approval by some American agency.


They’ve got Saudi oil money behind them no?

Falcon always struck me more as a regional prestige project rather than “how to monetise”


The Saudis are on a clock. It's got decades on it, but the person who witnesses the end of their oil has already been born.

It may be a prestige project today but make no mistake there's a long game behind it.


Emirati oil money


The 40b model appears to be pure apache though


That is Falcon 1, not Falcon 2.

Falcon 1 is entirely obsolete at this point, based on every benchmark I've seen.


Well, that really sucks.

Thanks But No Thanks.


It's probably unenforceable


When it is backed by the UAE the muscle you have to contend with is not simply legal muscle, it also includes armed muscle of questionable moral fibre (see support for the RSF).


Do you have any examples of the UAE using its military force to force foreign companies to abide a contract?


they're not going to send their army after you, but that's not what this means.

friendly middle-eastern countries negotiate all kinds of concessions from western governments in exchange for allowing military operations to stage in their country, and "we want you to enforce our IP laws" is an easy one for western governments to grant.


Maybe, but would you risk trying?


How is changing a software licence unenforceable?


scolson said it best:

You can retroactively make a license more open, but you cannot retroactively make it more closed.

https://news.ycombinator.com/item?id=10672751


And even then you aren't retroactively making it more open. You are just now offering an additional, more open license as well as the existing one.

You haven't taken the original license away, you just provided a better default option.

The same weirdly enough goes in reverse as well. You can provide a more restrictive license retroactively even if the rights holders don't consent as long as the existing license is compatible with the new, more restrictive license. i.e. you can promote a work from Apache-2.0 to GPL-3.0-or-later as the former is fully compatible with the latter. However you can't stop existing users from using it as Apache-2.0, you can only stop offering it yourself with that license (but anyone who has an existing Apache-2.0 copy or who is an original rights holder can freely distribute it).


Didn’t Oracle try with OpenSolaris?


> Great example of why I don't like the trend of calling licenses like this "open source" when they aren't compatible with the OSI definition.

Open source was always a way to weasel about terms, that's why it's open source and not free.


More the other way around.

Just because it's free doesn't mean you can change anything or get the source.

Some claim something is open source but it's just for free.


> Just because it's free doesn't mean you can change anything or get the source.

Free as in speech, not beer.

> Some claim something is open source but it's just for free.

Gratis, not free.


> New Falcon 2 11B Outperforms Meta’s Llama 3 8B, and Performs on par with leading Google Gemma 7B Model

I was strongly under the impression that Llama 3 8B outperformed Gemma 7B on almost all metrics.


Keep in mind that this is a comparison of base models, not chat tuned models, since Falcon-11B does not have a chat tuned model at this time. The chat tuning that Meta did seems better than the chat tuning on Gemma.

Regardless, the Gemma 1.1 chat models have been fairly good in my experience, even if I think the Llama3 8B chat model is definitely better.

CodeGemma 1.1 7B is especially underrated compared to my testing of other relevant coding models. The base CodeGemma 7B base model is one of the best models I’ve tested for code completion, and the chat model is one of the best models I’ve tested for writing code. Some other models seem to game the benchmarks better, but in real world use, don’t hold up as well as CodeGemma for me. I look forward to seeing how CodeLlama3 does, but it doesn't exist yet.


The model type is a good point. It's hard to track all the variables in this very fast paced field.

Thank you for sharing your CodeGemma experience. I haven't found a emacs setup I'm satisfied with, using a local llm, but it will surely happen one day. Surely.


for me, CodeGemma is super slow. I'd say 3-4 times slower than llama3. I am also looking forward to CodeLlama3 but I have a feeling Meta can't improve on llama3 it. Was there anything official from Meta?


CodeGemma has fewer parameters than Llama3, so it absolutely should not be slower. That sounds like a configuration issue.

Meta originally released Llama2 and CodeLlama, and CodeLlama vastly improved on Llama2 for coding tasks. Llama3-8B is okay at coding, but I think CodeGemma-1.1-7b-it is significantly better than Llama3-8B-Instruct, and possibly a little better than Llama3-70B-Instruct, so there is plenty of room for Meta to improve Llama3 in that regard.

> Was there anything official from Meta?

https://ai.meta.com/blog/meta-llama-3/

"The text-based models we are releasing today are the first in the Llama 3 collection of models."

Just a hint that they will be releasing more models in the same family, and CodeLlama3 seems like a given to me.


I suppose it could be quantization issue, but both are done by lmstudio-community. Llama3 does have a different architecture and bigger tokenizer which might explain it.


You should try ollama and see what happens. On the same hardware, with the same q8_0 quantization on both models, I'm seeing 77 tokens/s with Llama3-8B and 72 tokens/s with CodeGemma-7B, which is a very surprising result to me, but they are still very similar in performance.


You're right, ollama does perform the same on both models. Thanks.


Anecdotal, I know, but in my experience Gemma is absolutely worthless and Llama 3 8b is exceptionally good for its size. The idea that Gemma is ahead of Llama 3 is bizarre to me. Surely there’s some contamination or something if Gemma is showing up ahead in some benchmarks‽


Adding more anecdata, but this has been exactly my experience as well. I haven't dug into details about the benchmarks, but just trying to use the things for basic question asking, Llama 3 is so much better it's like comparing my Milwaukee drill to my sons Fisher Price plastic toy drill.


Yeah, Llama3 is also smoking Mistral/Mixtral. It’s my new model of choice.


I found that curious too.

I don’t stay up on the benchmarks much these days though; I’ve fully dedicated myself to b-ball.

I’m actually a bit better than Lebron btw, who is nowhere near as good as my 3 year old daughter. I occasionally beat her. At basketball.


Sigh, I thought this was going to be about Spectrum Holobyte’s Falcon AT. From MyAbandonware.com:

> Essentially Falcon 2 but somehow marketed differently, Falcon AT is the second release in Spectrum Holobyte's revolutionary hard-core flight sim Falcon series. Despite popular belief that Falcon 3.0 was THE dawn of modern flight sims, Falcon AT actually is already a huge leap over Falcon, sporting sharp EGA graphics, and a lot of realistic options and greatly expanded campaigns. The game is still the simulation of modern air combat, complete with excellent tutorials, varied missions, and accurate flight dynamics that Falcon fans have come to know and love. Among its host of innovations is the amazingly playable multiplayer options -- including hotseat and over the modem. Largely forgotten now, Falcon AT serves to explain the otherwise inexplicable gap between Falcon and Falcon 3.0.


There seems to be a trend of people naming new things (perhaps unintentionally) after classic computer games. We just had a post here on a system called Loom which apparently isn't the classic adventure game. I'm half expecting someone to come up with an LLM or piece of networking software and name it Zork.


It doesn't help any that currently "F-16 Strike Eagle II reverse engineering" <https://news.ycombinator.com/item?id=40347662> is also on the front page, "priming" one to think similarly


"only AI Model with Vision-to-Language Capabilities"

What do they mean by this? Isn't this roughly what GPT-4 Vision and LLaVA do?


At first I thought they were playing some semantic game.

Something like LLaVA being a language to vision model but I can't steelman the idea so it makes sense.

Maybe they're just lying?


And all Claude models…


And Gemini.


I welcome open models, although the Falcon model is not super open, as noted here. I will say that the original Falcon did not perform as well as its benchmark stats indicated -- it was pushed out as a significant leap forward, and I didn't find it outperformed competitive open models at release.

The PR stating an 11B model outperforms 7B and 8B models 'in the same class' feels like it might be stretching a bit. We'll see -- I'll definitely give this a go for local inference. But, my gut is that finetuned llama 3 8B is probably best in class...this week.


> I will say that the original Falcon did not perform as well as its benchmark stats indicated

Yea I saw that as well. I believe it was undertrained in terms of parameters vs tokens because they really just wanted to have a 40bn parameter model (like pre chinchilla optimal)


It's hard to know if there's any special sauce here, but the internet so far has decided "meh" on these models. I think it's an interesting choice to put it out as tech competitive. Stats say this one was trained on 5T tokens. For reference, Llama 3 so far was reported at 15T.

There is no way you get back what you lost in training by expanding parameters 3B.

If I were in charge of UAE PR and this project, I'd

a) buy a lot more H100s and get the training budget up

b) compete on a regional / messaging / national freedom angle

c) fully open license it

I guess I'm saying I'd copy Zuck's plan, with oil money instead of social money and play to my base.

Overstating capabilities doesn't give you a lot of benefit out of a local market, unfortunately.


These reminders that AI will not only be wielded by democracies with (at least partial attempts at) ethical oversight, but also by the worst of the worst autocrats, are truly chilling.


>but also by the worst of the worst autocrats

MBZ (note MBZ is not MBS; Saudia Arabia and UAE are two different countries!) is one of the most popular leaders in the world and his people among the wealthiest. His country is one of the few developed countries in the world where the economy is still growing steadily, and one of the safest countries in the world outside of East Asia, in spite of having one of the world's most liberal immigration policies. Much more a contender for the best of the best autocrats than the worst of the worst.


I want to understand something: the model was trained on mostly a public dataset(?), with hardware from AWS, using well-known algorithms and techniques. How is it different from other models that anyone that has the money can train?

My skeptic/hater(?) mentality, sees this as only a "flex" and an effort to try be seen as relevant. Is there more to this kind of effort that I'm not seeing?


A lot of models are in this category. Sovereignty (whether national or corporate) has some value. And the threat of competition is a good thing for everyone. I'm glad people are working on these even if the end result in most cases isn't anything particularly interesting.


For a moment, I thought this might be related to the classic flight sim:

https://en.wikipedia.org/wiki/Falcon_4.0


Also, SpaceX has the Falcon 1 and Falcon 9 rockets, as well as the proposed but never developed Falcon 5.


Absurdly biased article, cmon UAE be more subtle! “Beats llama 3” is a dubiously helpful summary, and this is just baffling:

  and is only AI Model with Vision-to-Language Capabilities


> Outperforming Meta’s New Llama 3

I know it's hard to objectively rank LLMs, but those are really ridiculous ways to keep track of performance.

If my reference of performance is (like the vast majority of users) ChatGPT-3.5, I have to first know how Llama 3 compares to that to then understand how that new models compare to what I'm using at the moment.

Now, if I look for the performance of Llama 3 compared to ChatGPT-3.5, I don't find it on the official launch page https://ai.meta.com/blog/meta-llama-3/ where it is compared to Gemma 7B it, Mistral 7B Instruct, Gemini Pro 1.5 and Claude 3 Sonnet.

How does Gemma 7B perform? Well you can only find out how it compares to Llama 2 on the official launch page https://blog.google/technology/developers/gemma-open-models/.

Let's look at the Llama 2 performance on its launch announcement: https://llama.meta.com/llama2/ No GPT-3.5 turbo again.

I get that there are multiple aspects and that there's probably not one overall "performance" metric across all tasks, and I get that you can probably find a comparative between two specific models relatively easily, but there absolutely needs to be a standard by which those performances are communicated. The number of hoops to jump through is ridiculous.


Human preference data from side by side, anonymous comparisons of models: https://leaderboard.lmsys.org/

Llama3 8B significantly outperforms ChatGPT-3.5, and LLama3 70B is significantly better than that. These are ELO ratings, so it would not be accurate to try to say X is 10% better than Y because the score is 10% higher.

Obviously Falcon 2 is too new to be on the leaderboard yet.

Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.


Human preference does not always favor the model that is best at reasoning/code/accuracy whatever. In particular there's a recent article suggesting that Llama 3's friendly and direct chattiness contributes to it having a good standing in the leaderboard.

https://lmsys.org/blog/2024-05-08-llama3/


Sure, that’s why I called it out as human preference data. But I still think the leaderboard is one of the best ways to compare models that we currently have.

If you know of better benchmark-based leaderboards where the data hasn’t polluted the training datasets, I’d love to see them, but just giving up on everything isn’t a good option.

The leaderboard is a good starting point to find models worth testing, which can then be painstakingly tested for a particular use case.


Oh I didn't mean that. I think it's the best benchmark, just it's not necessarily representative of ordering in any domain apart from generic human preference. So while Llama3 is high up there, we should not conclude for example that it is better at reasoning than all models below it (especially true for the 8B model).


I find that kind of surprising; the lack of “customer service voice” is one of the main reasons I prefer the Mistral models over Open AI’s, even if the latter are somewhat better at complex/specific tasks.


> Honestly, I don't think anybody should be using ChatGPT-3.5 as a chatbot at this point. Google and Meta both offer free chatbots that are significantly better than ChatGPT-3.5, among other options.

Yet I guarantee you that ChatGPT-3.5 has 95% of the "direct to consumer" marketshare.

Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

Now onto the ranking, I perfectly recognized in my original comment that those comparisons exist, just that they're not highlighted properly in any launch announcement of any new model.

I haven't used Llama, only ChatGPT and the multiple versions of Claude 2 and 3. How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?


> Unless you're a technical user, you haven't even heard about any alternative, let alone used them.

> How am I supposed to know if this Falcon 2 thing is even worth looking at beyond the first paragraph if I have to compare it to a specific model that I haven't used before?

You're not. These press releases are for the "technical users" that have heard of and used all of these alternatives.

They are not offering a Falcon 2 chat service you can use today. They aren't even offering a chat-tuned Falcon 2 model. The Falcon 2 model in question is a base model, not a chat model.

Unless someone is very technical, Falcon 2 is not relevant to them in any way at this point. This is a forum of technical people, which is why it's getting some attention, but I suspect it's still not going to be relevant to most people here.


There's https://chat.lmsys.org/?leaderboard

Not a __full__ list, but big enough to have some reference.


Yeah that's my point. Say in the title that it ranks #X on the leaderboards, not that it's "better" than some cherry-picked model.



"When tested against several prominent AI models in its class among pre-trained models, Falcon 2 11B surpasses the performance of Meta’s newly launched Llama 3 with 8 billion parameters (8B), and performs on par with Google’s Gemma 7B at first place, with a difference of only 0.01 average performance (Falcon 2 11B: 64.28 vs Gemma 7B: 64.29) according to the evaluation from Hugging Face."

From: https://falconllm.tii.ae/falcon-2.html


So Falcon 2 with 11B params outperform Llama 3 8B? With more parameters that doesn’t make a fair comparison. The strongest open source model seems to be Llama 3 70B, why claim outperforming Llama 3 when you didn’t outperform the best model?


"With the release of Falcon 2 11B, we've introduced the first model in the Falcon 2 series."

https://lifearchitect.ai/models-table/


First headline in bold: "Next-Gen Falcon 2 Series launches [...] and is only AI Model with Vision-to-Language Capabilities" ...


The speed things are moving, it feels like we'll get a GPT-4 level "small" model really soon.


I guess that explains why OpenAI are rushing to make their models free despite having paying users. They don't want to lose market share to local LLMs just yet.


Time is coming when you'll be able to select out of hundreds of models and they'll be downloaded on demand if not already, would totally infer locally and offline even on your phone and I guess no more than 2030, we'll be there.

I am not so up to date with the hardware landscape but don't think smart people would let not be noticing the need.


Cloud models will always have edge compared to local models. Maybe in 2030 your iPhone would run GPT-4 on your phone but cloud GPT-9 will solve all your kids homework, do 95% of your job and manage your household.


I'm pretty sure models are at their end game already. Now the law of diminishing returns is at play.


Transformers still didn't hit their scaling limits wrt number of tokens trained on and the number of layers. Model size limits are now purely given by finances ($100M/training run still seems to be too excessive).


That is true. And that financial limits might change however.


It would be interesting to hear more about the compute used to build this.


From model card:

Falcon2-11B was trained on 1024 A100 40GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.

Doesn't say how long though.


It does say how long on Huggingface:

> The model training took roughly two months.


I am a bit disappointed that it isn't about a new, small rocket from SpaceX with two first-stage engines.


11b model outperforms 8b model, news at 11




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: