Their benchmark results seem roughly on par with Mistral 7B and Llama 3 8B, whic...

nabakin · on May 13, 2024

Exactly. Falcon-180b had a lot of hype at first but the community soon realized it was nearly worthless. Easily outperformed by smaller LLMs in the general case.

Now they are back and claiming their falcon-11b LLM outperforms Llama 3 8b. I already see a number of issues with this:

- falcon-11b is like 40% larger than Llama 3 8b so how can you compare them when they aren't in the same size class

- their claim seems to be based on automated benchmarks when it has long been clear that automated benchmarks are not enough to make that claim

- some of their automated benchmarks are wildly lower than Llama 3 8b's scores. It only beats Llama 3 8b on one benchmark and just barely. I can make an LLM does the best anyone has ever seen on one benchmark, but that doesn't mean my LLM is good. Far from it

- clickbait headline with knowingly premature claims because there has been zero human evaluation testing

- they claim their LLM is better than Llama 3 but completely ignore Llama 3 70b

Honestly, it annoys me how much attention tiiuae get when they haven't produced anything useful and continue this misleading clickbait.

hehdhdjehehegwv · on May 13, 2024

Seems to be the case with all their models - really huge in size, no actual performance gains for the effort.

Their refined web dataset is heavily censored so maybe that has something to do with it. It’s very morally conservative - total exclusion of pornography and other topics.

So I’d not be surprised if some of the issues are they are just filtering out too much content and adding more of the same instead.

kaetemi · on May 13, 2024

What? Falcon-7B base model is pretty much one of the only few small models that'll happily write a whole coherent fanfic all the way to the end without getting stuck in a loop right before the explicit content.

Ignore instruct tunes.

marci · on May 13, 2024

Maybe that's not the right metrics to compare.

True, the model is bigger, but required less tokens than Llama 3 to train. The issue is when there's no open datasets, it's hard to really compare and replicate. Is it because of the model's architecture? Dataset quality? Model size? A mixture of those? Something else?

dragonwriter · on May 13, 2024

> True, the model is bigger, but required less tokens than Llama 3 to train.

That…doesn’t matter to users. User’s care what it can do, and what it requires for them to use it, not what it took for you to make it.

Sure, if it has better performance relative to training set size that’s interesting from a scientific perspective and learning about how to train models, maybe, if it scales the same as other models in that regard. But ultimately, for use, until you get to a model that does better absolutely, or does better relevant to models with the same resource demands, you aren’t offering an advantage.

marci · on May 14, 2024

I understand, I'm just glad for the possible implications for future models: less expensive to make => less expensive to iterate. MoE are cheaper to train. My favorite right now is Wizard 8x22b, so as a random user, I don't really care about this model. Will probably never run it as-is. But makes me hope for a Falcon-MoE.

Also, the fact that it's less dense than llama 3 means there may be more room for lora fine-tuning, and at a lesser cost than required for llama 3 while sacrificing way less of its smarts. That may be my use.