And only twice as expensive as the competing hardware you use to run R1 671B at ...

mrcwinn · 2025-01-29T03:36:38 1738121798

Serious question coming from ignorance — what is the most cost effective way to run this locally, Mac or PC? Please, no fanboyism from either side. My understanding is that Apple's unified memory architecture is a leg up for that platform given the memory needs of these models, versus stringing together lots of NVidia GPUs.

Maybe I'm mistaken! Grateful to be corrected.

wincy · 2025-01-29T03:47:23 1738122443

I think for $6000 you can run an EPYC setup. But the token/sec is going to be objectively slower than the Macs. What you gain on them is speed. I read this [0] on X earlier today which seems like a good guide on how to get yourself up and running.

[0] https://x.com/i/bookmarks/1884342681590960270?post_id=188424...

coder543 · 2025-01-29T04:03:55 1738123435

The most cost-effective way is arguably to run it off of any 1TB SSD (~$55) attached to whatever computer you already have.

I was able to get 1 token every 6 or 7 seconds (approximately 10 words per minute) on a 400GB quant of the model, while using an SSD that benchmarks at a measly 3GB/s or so. The bottleneck is entirely the speed of the SSD at that level, so an SSD that is twice as fast should make the model run about twice as fast.

Of course, each message you send would have approximately a 1 business day turnaround time… so it might not be the most practical.

With a RAID0 array of two PCIe 5.0 SSDs (~14GB/s each, 28GB/s total), you could potentially get things up to an almost tolerable speed. Maybe 1 to 2 tokens per second.

It’s just such an enormous model that your next best option is like $6000 of hardware, as another comment mentioned, and that is probably going to be significantly slower than the two M2 Ultra Mac Studios featured in the current post. It’s a sliding scale of cost versus performance.

This model has about half as many active parameters as Llama3-70B, since it has 37B active parameters, so it’s actually pretty easy to run computationally… but the catch is that you have to be able to access any 37B of those 671B parameters at any time, so you have to find somewhere fast to store the entire model.