> I'd be much more excited to see what level of coding and reasoning performance can be wrung out of a much smaller LLM + agent
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
Sure, but that's the point ... today's locally runnable models are a long way behind SOTA capability, so it'd be nice to see more research and experimentation in that direction. Maybe a zoo of highly specialized small models + agents for S/W development - one for planning, one for coding, etc?
If I understand transformers properly, this is unlikely to work. The whole point of “Large” Language Models is that you primarily make them better by making them larger, and when you do so, they get better at both general and specific tasks (so there isn’t a way to sacrifice generality but keep specific skills when training a small models).
I know a lot of people want this (Apple really really wants this and is pouring money into it) but just because we want something doesn’t mean it will happen, especially if it goes against the main idea behind the current AI wave.
I’d love to be wrong about this, but I’m pretty sure this is at least mostly right.
I think this is a description of how things are today, but not an inherent property of how the models are built. Over the last year or so the trend seems to be moving from “more data” to “better data”. And I think in most narrow domains (which, to be clear, general coding agent is not!) it’s possible to train a smaller, specialized model reaching the performance of a much larger generic model.
Actually there are ways you might get on device models to perform well. It is all about finding ways to have a smaller number of weights work efficiently.
One way is reusing weights in multiple decoders layers. This works and is used in many on-device models.
It is likely that we can get pretty high performance with this method. You can also combine this with low parameter ways to create overlapped behavior on the same weights as well, people had done LORA on top of shared weights.
Personally I think there are a lot of potential ways that you can cause the same weights to exhibit "overloaded" behaviour in multiple places in the same decoder stack.
Edit: I believe this method is used a bit for models targeted for the phone. I don't think we have seen significant work on people targeting say a 3090/4090 or similar inference compute size.
The issue isn't even 'quality' per se (for many tasks a small model would do fine), its for "agentic" workflows it _quickly_ runs out of context. Even 32GB VRAM is really very limiting.
And when I mean agentic, i mean something even like this - 'book a table from my emails', which involves looking at 5k+ tokens of emails, 5k tokens of search results, then confirming with the user etc. It's just not feasible on most hardware right now - even if the models are 1-2GB, you'll burn thru the rest in context so quickly.
Yeah - the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access. Maybe that could change in the future to sale or annual licence business model if running locally became possible.
I think scale helps for general tasks where the breadth of capability may be needed, but it's not so clear that this needed for narrow verticals, especially something like coding (knowing how to fix car engines, or distinguish 100 breeds of dog is not of much use!).
> the whole business model of companies like OpenAI and Anthropic, at least at the moment, seems to be that the models are so big that you need to run them in the cloud with metered access.
That's not a business model choice, though. That's a reality of running SOTA models.
If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves. It would cut their datacenter spend dramatically.
> If OpenAI or Anthropic could squeeze the same output out of smaller GPUs and servers they'd be doing it for themselves.
First, they do this; that's why they release models at different price points. It's also why GPT-5 tries auto-routing requests to the most cost-effective model.
Second, be careful about considering the incentives of these companies. They all act as if they're in an existential race to deliver 'the' best model; the winner-take-all model justifies their collective trillion dollar-ish valuation. In that race, delivering 97% of the performance at 10% of the cost is a distraction.
No I don’t think it’s a business model thing, I’m saying it may be a technical limitation of LLMs themselves. Like, that that there’s no way to “order a la carte” from the training process, you either get the buffet or nothing, no matter how hungry you feel.
> today's locally runnable models are a long way behind SOTA capability
SOTA models are larger than what can be run locally, though.
Obviously we'd all like to see smaller models perform better, but there's no reason to believe that there's a hidden secret to making small, locally-runnable models perform at the same level as Claude and OpenAI SOTA models. If there was, Anthropic and OpenAI would be doing it.
There's research happening and progress being made at every model size.
The point is still valid. If the big companies could save money running multiple small specialised models on cheap hardware, they wouldn't be spending billions on the highest spec GPUs.
You want more research on small language models? You're confused. There is already WAY more research done on small language models (SLM) than big ones. Why? Because it's easy. It only takes a moderate workstation to train an SLM. So every curious Masters student and motivated undergrad is doing this. Lots of PhD research is done on SLM because the hardware to train big models is stupidly expensive, even for many well-funded research labs. If you read Arxiv papers (not just the flashy ones published by companies with PR budgets) most of the research is done on 7B parameter models. Heck, some NeurIPS papers (extremely competitive prestigious) from _this year_ are being done on 1.5B parameter models.
Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.
It's not clear if the ultimate SLMs will come from teams with less computing resources directly building them, or from teams with more resources performing ablation studies etc on larger models to see what can be removed.
I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.
I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.
That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.
> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...
Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.
ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.
I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.
I wish this was true.
It isn't.
"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.
What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.
If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.
To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.
> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.
Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.
No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.
I even get the larger hosted models stuck chasing their own tail and going in circles all the time.
It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.
And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.
Almost all training data are on the internet. As long as the small model has enough agentic browsing ability, given it enough time it will retrieve the data from the internet.
This doesn't work like that. An analogy would be giving a 5 year old a task that requires the understanding of the world of an 18 year old. It doesn't matter whether you give that child 5 minutes or 10 hours, they won't be capable of solving it.
I think the question of what can be achieved with a small model comes down to what needs knowledge vs what needs experience. A small model can use tools like RAG if it is just missing knowledge, but it seems hard to avoid training/parameters where experience is needed - knowing how to perceive then act.
There is obviously also some amount (maybe a lot) of core knowledge and capability needed even to be able to ask the right questions and utilize the answers.
Small models handle simple, low context tasks most of the time correctly. But for more complex tasks, they fail due to insufficient training capacity and too few parameters to integrate the necessary relationships.
> Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Because nobody tried yet using recent developments.
> but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.
Sure there is: they can iterate faster on small model architectures, try more tweaks, train more models. Maybe the larger companies "considered it", but a) they are more risk-averse due to the cost of training their large models, b) that doesn't mean their conclusions about a particular consideration are right, empirical data decides in the end.
Well, I think you are seeing that already? It's not like these models don't exist and they did not try to make them good, it's just that the results are not super great.
And why would they be? Why would the good models (that are barely okay at coding) be big, if it was currently possible to build good models, that are small?
Of course, new ideas will be found and this dynamic may drastically change in the future, but there is no reason to assume that people who work on small models find great optimizations that frontier models makers, who are very interested in efficient models, have not considered already.