I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
As the comments on reddit said, those numbers don’t make sense.
> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.
That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.
> As the comments on reddit said, those numbers don’t make sense.
Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.
> By default, this downloads the main DeepSeek R1 model (which is large). If you’re interested in a specific distilled variant (e.g., 1.5B, 7B, 14B), just specify its tag
No… it downloads the 7B model by default. If you think that is large, then you better hold on to your seat when you try to download the 671B model.
>then you better hold on to your seat when you try to download the 671B model.
I ended up downloading it in case it ever gets removed off the internet for whatever reason. Who knows, if VRAM becomes much cheaper in 10 years I might be able to run it locally without spending a fortune on GPUs!
The most cost-effective way is arguably to run it off of any 1TB SSD (~$55) attached to whatever computer you already have.
I was able to get 1 token every 6 or 7 seconds (approximately 10 words per minute) on a 400GB quant of the model, while using an SSD that benchmarks at a measly 3GB/s or so. The bottleneck is entirely the speed of the SSD at that level, so an SSD that is twice as fast should make the model run about twice as fast.
Of course, each message you send would have approximately a 1 business day turnaround time… so it might not be the most practical.
With a RAID0 array of two PCIe 5.0 SSDs (~14GB/s each, 28GB/s total), you could potentially get things up to an almost tolerable speed. Maybe 1 to 2 tokens per second.
It’s just such an enormous model that your next best option is like $6000 of hardware, as another comment mentioned, and that is probably going to be significantly slower than the two M2 Ultra Mac Studios featured in the current post. It’s a sliding scale of cost versus performance.
This model has about half as many active parameters as Llama3-70B, since it has 37B active parameters, so it’s actually pretty easy to run computationally… but the catch is that you have to be able to access any 37B of those 671B parameters at any time, so you have to find somewhere fast to store the entire model.
It’s not even a question. I still have my Pebble Time Round in a box. No other watch has come close. That thing was probably under half the thickness and half the weight of my Apple Watch Ultra and yet it had 3x the battery life!
The physical controls were incredible… I actually controlled my music from that watch, because you could do it without looking. Who has ever controlled their music from an Apple Watch more than once a month? It’s so clunky and cumbersome, and you have to intently stare at what you’re doing — you might as you reach for your phone.
I can’t believe we’re a decade on, and the smartwatches that exist today are a joke compared to the PTR. The only practical improvement from my Apple Watch Ultra is the fitness tracking / health tracking sensor suite… which is great, but I miss having a good smartwatch, not just a fitness tracker.
I understand you were trying to make “up and to the right” = “best”, but the inverted x-axis really confused me at first. Not a huge fan.
Also, I wonder how you’re calculating costs, because while a 3:1 ratio kind of sort of makes sense for traditional LLMs… it doesn’t really work for “reasoning” models that implicitly use several hundred to several thousand additional output tokens for their reasoning step. It’s almost like a “fixed” overhead, regardless of the input or output size around that reasoning step. (Fixed is in quotes, because some reasoning chains are longer than others.)
I would also argue that token-heavy use cases are dominated by large input/output ratios of like 100:1 or 1000:1 tokens. Token-light use cases are your typical chatbot where the user and model are exchanging roughly equal numbers of tokens… and probably not that many per message.
It’s hard to come up with an optimal formula… one would almost need to offer a dynamic chart where the user can enter their own ratio of input:output, and choose a number for the reasoning token overhead. (Or, select from several predefined options like “chatbot”, “summarization”, “coding assistant”, where those would pre-select some reasonable defaults.)
i mean the sheet is public https://docs.google.com/spreadsheets/d/1x9bQVlm7YJ33HVb3AGb9... go fiddle with it yourself but you'll soon see most models hve approx the same input:output token ratio cost (roughly 4) and changing the input:output ratio assumption doesnt affect in the slightest what the overall macro chart trends say because i'm plotting over several OoMs here and your criticisms have the impact of <1 OoM (input:output token ratio cost of ~4 with variance even lower than that).
actually the 100:1 ratio starts to trend back toward parity now because of the reasoning tokens, so the truth is somewhere between 3:1 and 100:1.
Cursor’s composer agent is so slick. I tried Zed out for a few minutes this evening, and in addition to what you said, I was surprised that there wasn’t an option to bring my own tab completion server.
Zed felt very nice, definitely reminded me of the good old days of Sublime. With literally the two features mentioned above, I think I would switch… but it’s easy to say until it happens for sure.
“One”? Wired up how? There is a huge difference between the best and worst. They aren’t fungible. Which one? How long ago? Did it even support FIM (fill in middle), or was it blindly guessing from the left side? Did the plugin even gather appropriate context from related files, or was it only looking at the current file?
If you try Copilot or Cursor today, you can experience what “the best” looks like, which gives you a benchmark to measure smaller, dumber models and plugins against. No, Copilot and Cursor are not available for emacs, as far as I know… but if you want to understand if a technology is useful, you don’t start with the worst version and judge from that. (Not saying emacs itself is the worst… just that without more context, my assumption is that whatever plugin you probably encountered was probably using a bottom tier model, and I doubt the plugin itself was helping that model do its best.)
There are some local code completion models that I think are perfectly fine, but I don’t know where you will draw the line on how good is good enough. If you can prove to yourself that the best models are good enough, then you can try out different local models and see if one of those works for you.
I hacked up a slim alternative localpilot.js layer that uses llama-server instead of the copilot API, so copilot.el can be used with local LLMs, but I find the copilot.el overlays kinda buggy...
It'd probably be better to instead write a llamapilot.el for local LLMs from scratch for emacs.
Emacs has had multiple llm integration packages available for quite awhile (relative to the rise of llms). `gptel` supports multiple providers including anthropic, openai, ollama, etc.
SSE was first built into a web browser back in 2006. By 2011, it was supported in all major browsers except IE. SSE is really just an enhanced, more efficient version of long polling, which I believe was possible much earlier.
Websocket support was added by all major browsers (including IE) between 2010 and 2012.
As the comments on reddit said, those numbers don’t make sense.