My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.
I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.
For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.
At 4 bits that model won't fit into 128GB so you're spilling over into swap which kills performance. I've gotten great results out of glm-4.5-air which is 4.5 distilled down to 110B params which can fit nicely at 8 bits or maybe 6 if you want a little more ram left over.
GPT-oss-120B was also completely failing for me, until someone on reddit pointed out that you need to pass back in the reasoning tokens when generating a response. One way to do this is described here:
Once I did that it started functioning extremely well, and it's the main model I use for my homemade agents.
Many LLM libraries/services/frontends don't pass these reasoning tokens back to the model correctly, which is why people complain about this model so much. It also highlights the importance of rolling these things yourself and understanding what's going on under the hood, because there's so many broken implementations floating around.
I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.
Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).
It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.
I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.
Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.
- For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."
- Doing classification / selection type of work eg. classifying business leads based on the profile
Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.
My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.
You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work
I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.
This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
Also, vendors need to make a profit! So tack a little extra on as well.
However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...
Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.
Yes they conveniently forget about disclosing prompt processing time. There is an affordable answer to this, will be open sourcing the design and sw soon.
Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.
So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
> in theory, I could get a "relatively" cheap Mac Studio and run this locally
In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.
$10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)
> $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it
I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
No, but the models you will be able to run, will run fast and many of them are Good Enough(tm) for quite a lot of tasks already. I mostly use GPT-OSS-120B and glm-4.5-air currently, both easily fit and run incredibly fast, and the runners haven't even yet been fully optimized for Blackwell so time will tell how fast it can go.
No… that’s not how this works. 96GB sounds impressive on paper, but this model is far, far larger than that.
If you are running a REAP model (eliminating experts), then you are not running GLM-4.7 at that point — you’re running some other model which has poorly defined characteristics. If you are running GLM-4.7, you have to have all of the experts accessible. You don’t get to pick and choose.
If you have enough system RAM, you can offload some layers (not experts) to the GPU and keep the rest in system RAM, but the performance is asymptotically close to CPU-only. If you offload more than a handful of layers, then the GPU is mostly sitting around waiting for work. At which point, are you really running it “on” the RTX Pro 6000?
If you want to use RTX Pro 6000s to run GLM-4.7, then you really need 3 or 4 of them, which is a lot more than $10k.
And I don’t consider running a 1-bit superquant to be a valid thing here either. Much better off running a smaller model at that point. Quantization is often better than a smaller model, but only up to a point which that is beyond.
You don't need a REAP-processed model to offload on a per-expert basis. All MoE models are inherently sparse, so you're only operating on a subset of activated layers when the prompt is being processed. It's more of a PCI bottleneck than a CPU one.
> And I don’t consider running a 1-bit superquant to be a valid thing here either.
Yes, you can offload random experts to the GPU, but it will still be activating experts that are on the CPU, completely tanking performance. It won't suddenly make things fast. One of these GPUs is not enough for this model.
You're better off prioritizing the offload of the KV cache and attention layers to the GPU than trying to offload a specific expert or two, but the performance loss I was talking about earlier still means you're not offloading enough for a 96GB GPU to make things how they need to be. You need multiple, or you need a Mac Studio.
If someone buys one of these $8000 GPUs to run GLM-4.7, they're going to be immensely disappointed. This is my point.
$10k is > 4 years of a $200/mo sub to models which are currently far better, continue to get upgraded frequently, and have improved tremendously in the last year alone.
This almost feels like a retro computing kind of hobby than anything aimed at genuine productivity.
I don't think the calculation is that simple. With your own hardware, there literally is no limits of runtime, or what models you use, or what tooling you use, or availability, all of those things are up to you.
Maybe I'm old school, but I prefer those benefits over some cost/benefit analysis across 4 years which by the time we're 20% through it, everything has changed.
But I also use this hardware for training my own models, not just inference and not just LLMs, I'd agree with you if we were talking about just LLM inference.
Because Apple has not adjusted their pricing yet for the new ram pricing reality. The moment they do, its not going to be a $10k system anymore but in the $15k+...
The amount of wafers going to AI is insane and will influence not just memory prices. Do not forget, the only reason why Apple is currently immunity to this, is because they tend to make long term contracts but the moment those expire ... then will push the costs down consumers.
No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
Man, really? Why, just why? If it's similar, why not just the same? It's like they're purposefully adding more work for the ecosystem to support their special model instead of just trying to add more value to the ecosystem.
The parser is a small part of running an LLM, and Zai's format is superior to Harmony: it avoids having the model escape JSON in most cases by using XML, so e.g. long code edits are more in-domain compared to pretraining data (where code is typically not nested in JSON and isn't JSON-escaped). FWIW almost everyone has their own format.
Also, Harmony is a mess. The common API specs adopted by the open-source community don't have developer roles, so including one is just bloat for the Responses API no one outside of OpenAI adopted. And why are there two types of hidden CoT reasoning? Harmony tool definition syntax invents a novel programming language that the model has never seen in training, so you need even more post-training to get it to work (Zai just uses JSON Schema). Etc etc. It's just bad.
Re: removing newlines from their old format, it's slightly annoying, but it does give a slight speed boost, since it removes one token per call and one token per argument. Not a huge difference, but not nothing, especially with parallel tool calls.
Sometimes worse is better, I don't really care what the specific format is, just that providers/model releasers would use more of the same, because compatibility sucks when everyone has their very own format. Conveniently for them, it gets harder to compare models when everyone has different formats too.
Whenever reasoning/thinking is involved, 20t/s is way too slow for most non-async tasks, yeah.
Translation, classification, whatever. If the response is 300 tokens for the reasoning and 50 tokens for the final reply, you're sitting and waiting 17,5 seconds for processing one item. In practice, you're also forgetting about prefill, prompt processing, tokenization and such. Please do share all relevant numbers :)
The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.
RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
The benefit is also RAM bandwidth. That probably adds to the confusion, but it matters a lot for decode. But yes, RAM capacity requirements stay the same.
For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.
You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.
Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU".
As to its efficacy, I'd love to know...
It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.
Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.
This model is much stronger than 3.5 sonnet, 3.5 sonnet scored 49% on swe-bench verified vs. 72% here. This model is about 4 points ahead of sonnet4, but behind sonnet 4.5 by 4 points.
If I were to guess, we will see a convergence on measurable/perceptible coding ability sometime early next year without substantially updated benchmarks.
I tested the previous one GLM-4.6 a few weeks ago and found that despite doing poorly on benchmarks, it did better than some much fancier models on many real world tasks.
Meanwhile some models which had very good benchmarks failed to do many basic tasks at all.
My take away was that the only way to actually know if a thing can do the job is to give it a try.
This is true assuming there will be updates consistently. One of the advantages of the proprietary models is that the are updated often EKG and the cutoff date moves into the future
This is important because libraries change, introduce new functionality, deprecate methods and rename things all the time, e.g. Polars.
commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.
There can be quality differences across vendors for the same model due to things like quantization or configuration differences in their backend. By running locally you ensure you have consistency in addition to availability and privacy
i am not saying the desire to be uncoupled from token vendors is unreasonable, but you can rent cloud GPUs and run these models there. running on your own hardware is what seems a little fantastical at least for a reasonable TPS
I don't understand what is going on with people willing to give up their computing sovereignty. You should be able to own and run your own computation, permissionlessly as much as your electricity bill and reasonable usage goes. If you can't do it today, you should aim for it tomorrow.
Stop giving infinite power to these rent-seeking ghouls! Be grateful that open models / open source and semi-affordable personal computing still exists, and support it.
Pertinent example: imagine if two Strix Halo machines (2x128 GB) can run this model locally over fast ethernet. Wouldn't that be cool, compared to trying to get 256 GB of Nvidia-based VRAM in the cloud / on a subscription / whatever terms Nv wants?
I think you and I have a different definition of "obsessed." Would you label anyone interested in repairing their own car as obsessed with DIY?
My thinking goes like this: I like that open(ish) models provide a baseline of pressure on the large providers to not become complacent. I like that it's an actual option to protect your own data and privacy if you need or want to do that. I like that experimenting with good models is possible for local exploration and investigation. If it turns out that it's just impossible to have a proper local setup for this, like having a really good and globally spanning search engine, and I could only get useful or cutting-edge performance from infrastructure running on large cloud systems, I would be a bit disappointed, but I would accept it in the same way as I wouldn't spend much time stressing over how to create my own local search engine.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.