DeepSeek R1 671B running on 2 M2 Ultras faster than reading speed

mythz · 2025-01-29T03:48:02 1738122482

Someone also got the full Q8 R1 running on a $6K PC without a GPU on 2x EPYC with 768GB DDR5 RAM running at 6-8 tok/s [1].

Will be interesting to see the value/performance compared to next gen M4 Ultra's (or Extreme?) vs NVIDIA's new DIGITS [2] when they're released.

[1] https://x.com/carrigmat/status/1884244369907278106

[2] https://www.nvidia.com/en-us/project-digits/

CamperBob2 · 2025-01-29T04:29:32 1738124972

Alternative link: https://xcancel.com/carrigmat/status/1884244369907278106

mrcwinn · 2025-01-29T04:45:46 1738125946

Digits will be $3k and have 128GB of unified memory, so don't we already know that it wouldn't compare well this this rig? 128 won't be enough to fit the model in memory.

As for Apple, we'll see.

mythz · 2025-01-29T05:11:07 1738127467

They can be linked, e.g. 2x DIGITS can run 405B models [1]. Won't know what value/performance we can get until they start shipping them in May.

https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...

fulafel · 2025-01-29T06:53:52 1738133632

What's the memory bandwidth like compared to the above EPYC setup that the tweeter claims has "24 channels of DDR5 RAM" ?

rahimnathwani · 2025-01-29T04:01:21 1738123281

Wow!

6 to 8 tokens per second.

And less than a tenth of the cost of a GPU setup.

phonon · 2025-01-29T04:53:16 1738126396

Nice! Xeon 6 using AMX-BF16/INT8 Instructions should be something like 5 times faster than that....

danans · 2025-01-29T05:41:10 1738129270

Check out the power draw metrics. Following the CPU+GPU power consumption, it seems like it averaged 22W for about a minute. Unless I'm missing something, the inference for this example consumed at most .0004 kWh.

That's almost nothing. If these models are capable/functional enough for most day-to-day uses, then useful LLM-based GenAI is already at the "too cheap to meter" stage.

danans · 2025-01-30T17:52:31 1738259551

So it seems like this was actually 7 M2 Ultras, not 2, so .0028 kWh?

teruakohatu · 2025-01-29T03:47:02 1738122422

I am amazed mlx-lm/mlx.distributed works that well on prosumer hardware.

I don't think they specified what they were using for networking, but it was probably Thunderbolt/USB4 networking which can reach 40Gbps.

shihab · 2025-01-29T03:30:13 1738121413

Please note that it’s using pretty aggressive quantization (around 4 bits per weight)

doctoboggan · 2025-01-29T03:34:15 1738121655

Its not that aggressive of a quantization considering that the full model was trained at only 8 bits.

shihab · 2025-01-29T03:53:51 1738122831

That doesn't necessarily mean final weights are 8-bit though. Tensor core ops are usually mixed precision- matmul happens in low precision but accumulation (i.e. final result) is done in much higher precision to reduce error.

from deepseek v3:

"For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators...To further guarantee numerical stability, we store the master weights, weight gradients, and optimizer states in higher precision. "

OisinMoran · 2025-01-29T03:37:58 1738121878

That's 16x fewer possible values though (and also just 16 possible values full stop). It would be like giving every person on Earth the same shoe size.

leshokunin · 2025-01-29T03:50:34 1738122634

The Henry Ford school of product

rashidae · 2025-01-29T03:46:07 1738122367

This is amazing!! What kind of applications are you considering for this? A part from saving variable costs, fine tuning extensively and security… I’m curious to evaluate this in a financial perspective, as variable costs can be daunting, but not too much “yet”.

I’m hoping NVIDIA comes up with their new consumer computer soon!

iFred · 2025-01-29T03:45:14 1738122314

Complete aside, but I think this is the first time I’ve seen Apple’s internal DNS outside of Apple.

CharlesW · 2025-01-29T03:49:32 1738122572

scv = Santa Clara Valley

creativenolo · 2025-01-30T23:46:25 1738280785

How is this split between two computers?

DrNosferatu · 2025-01-29T09:46:07 1738143967

Heavily quantized…

Still interesting though.

mrcwinn · 2025-01-29T03:34:34 1738121674

Fascinating to read the thinking process of a flush vs a straight in poker. It's circular nonsense that is not at all grounded in reason — it's grounded in the factual memory of the rules of Poker, repeated over and over as it continues to doubt itself and double-check. What nonsense!

How many additional nuclear power plants will need to be built because even these incredibly technical achievements are, under the hood, morons? XD

talldayo · 2025-01-29T03:17:06 1738120626

[flagged]

epistasis · 2025-01-29T03:25:08 1738121108

That's a hell of a lot cheaper than running the equivalent H100 at home...

And cheaper than a lot hobbyists' bicycles!

talldayo · 2025-01-29T03:30:15 1738121415

And only twice as expensive as the competing hardware you use to run R1 671B at 3-bit quantization!

Ordinarily Apple customers cough up 3 or 4 times list price to match the performance of an equivalent PC. This is record-setting generosity from Cupertino.

mrcwinn · 2025-01-29T03:36:38 1738121798

Serious question coming from ignorance — what is the most cost effective way to run this locally, Mac or PC? Please, no fanboyism from either side. My understanding is that Apple's unified memory architecture is a leg up for that platform given the memory needs of these models, versus stringing together lots of NVidia GPUs.

Maybe I'm mistaken! Grateful to be corrected.

wincy · 2025-01-29T03:47:23 1738122443

I think for $6000 you can run an EPYC setup. But the token/sec is going to be objectively slower than the Macs. What you gain on them is speed. I read this [0] on X earlier today which seems like a good guide on how to get yourself up and running.

[0] https://x.com/i/bookmarks/1884342681590960270?post_id=188424...

coder543 · 2025-01-29T04:03:55 1738123435

The most cost-effective way is arguably to run it off of any 1TB SSD (~$55) attached to whatever computer you already have.

I was able to get 1 token every 6 or 7 seconds (approximately 10 words per minute) on a 400GB quant of the model, while using an SSD that benchmarks at a measly 3GB/s or so. The bottleneck is entirely the speed of the SSD at that level, so an SSD that is twice as fast should make the model run about twice as fast.

Of course, each message you send would have approximately a 1 business day turnaround time… so it might not be the most practical.

With a RAID0 array of two PCIe 5.0 SSDs (~14GB/s each, 28GB/s total), you could potentially get things up to an almost tolerable speed. Maybe 1 to 2 tokens per second.

It’s just such an enormous model that your next best option is like $6000 of hardware, as another comment mentioned, and that is probably going to be significantly slower than the two M2 Ultra Mac Studios featured in the current post. It’s a sliding scale of cost versus performance.

This model has about half as many active parameters as Llama3-70B, since it has 37B active parameters, so it’s actually pretty easy to run computationally… but the catch is that you have to be able to access any 37B of those 671B parameters at any time, so you have to find somewhere fast to store the entire model.

mrbungie · 2025-01-29T03:31:46 1738121506

It is a 671B params model, but I guess you already knew that from the title.

But you're right, just let's keep waiting for the town-sized Data Centers + Power Plants kindly served by our big tech overlords.

PS: If you refer to it being a Mac, obviously you can build a more cost efficient but difficult to cool rig.

sitkack · 2025-01-29T03:34:44 1738121684

14k in 2025 is about 6400 in 1992.

cma · 2025-01-29T03:33:51 1738121631

There are lots of people with ATVs and hobby motorbikes etc. that cost a good bit more.