Out of sheer curiosity: What’s required for the average Joe to use this, even at...

danielhanchen · 2025-05-28T19:05:16 1748459116

We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic

I'm working on the new one!

CamperBob2 · 2025-05-28T19:41:37 1748461297

Your 1.58-bit dynamic quant model is a religious experience, even at one or two tokens per second (which is what I get on my 128 MB Raptor Lake+4090). It's like owning your own genie... just ridiculously smart. Thanks for the work you've put into it!

nxobject · 2025-05-28T23:55:43 1748476543

Likewise - for me, it feels how I imagined getting a microcomputer in the 70s was like. (Including the hit to the wallet… an Apple II cost the 2024 equivalent of ~$5k, too.)

danielhanchen · 2025-05-29T01:14:39 1748481279

:) The good ol days!

danielhanchen · 2025-05-28T19:43:00 1748461380

Oh thank you! :) Glad they were useful!

behnamoh · 2025-05-29T00:08:43 1748477323

> 1.58bit quantization

of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.

danielhanchen · 2025-05-29T01:14:08 1748481248

Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!

You can do it via `-ot ".ffn_.*_exps.=CPU"`

behnamoh · 2025-05-29T03:09:22 1748488162

Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.

screaminghawk · 2025-05-28T19:37:04 1748461024

I use this a lot! Thanks for your work and looking forward to the next one

danielhanchen · 2025-05-28T19:43:13 1748461393

Thank you!! New versions should be much better!

terhechte · 2025-05-28T18:31:16 1748457076

You can run the 4bit quantized version of it on a M3 Ultra 512GB. That's quite expensive though. Another alternative is a fast CPU with 500GB of DDR5 RAM. That of course, is also not cheap and slower than the M3 Ultra. Or, you buy multiple Nvidia cards to reach ~500GB of VRam. That is probably the most expensive option but also the fastest

lodovic · 2025-05-28T19:59:25 1748462365

If you use the excess memory for AI only it's cheaper to rent . A single H100 costs less than $2 per hour. (incl power)

diggan · 2025-05-28T20:25:34 1748463934

Vast.ai has a bunch of 1x H100 SXM available, right now the cheapest at $1.554/hr.

Not affiliated, just a (mostly) happy user, although don't trust the bandwidth numbers, lots of variance (not surprising though, it is a user-to-user marketplace).

qingcharles · 2025-05-29T17:26:43 1748539603

Every time someone asks me what hardware to buy to run these at home I show them how many thousands of hours at vast.ai you could get for the same cost.

I don't even know how these Vast servers make money because there is no way you can ever pay off your hardware from the pennies you're getting.

omneity · 2025-05-28T22:23:33 1748471013

Worth mentioning that a single H100 (80-96GB) is not enough to run R1. You're looking at 6-8 GPUs on the lower end, and factor in the setup and download time.

An alternative is to use serverless GPU or LLM providers which abstract some of this for you, albeit at a higher cost and slow starts when you first use your model for some time.

zackangelo · 2025-05-29T06:37:08 1748500628

Yeah, to run the full precision model you need either two 8xH100 nodes connected via Infiniband or one 8xH200 node or one 8xB200 node.

Not for the GPU poor, to be sure.

girvo · 2025-05-29T00:25:32 1748478332

It is enough to run the dynamically quantised 1.56 bit version I believe, which is fun to play around with.

behohippy · 2025-05-28T18:31:13 1748457073

About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s

wongarsu · 2025-05-28T22:58:18 1748473098

About $8000 plus the GPU. Let's throw in a 4080 for about $1k, and you have the full setup for the price of 3 RTX5090. Or cheaper than a single A100. That's not a bad deal.

For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k

JKCalhoun · 2025-05-29T01:37:22 1748482642

Been putting together a "mining rig" [1] (or rather I was before the tariffs, ha ha.) Going to try to add a 2nd GPU soon. (And I should try these quantized versions.)

Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....

[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...

phonon · 2025-05-28T23:33:50 1748475230

This is the state of the art for such a setup. Really good performance!

https://github.com/kvcache-ai/ktransformers

mechagodzilla · 2025-05-28T19:53:32 1748462012

I have a $2k used dual-socket xeon with 768GB of DDR4 - It runs at about 1.5 tokens/sec for the 4-bit quantized version.

hu3 · 2025-05-28T18:33:57 1748457237

It's probably going to be free at OpenRouter.

There's already a 685B parameter DeepSeek V3 for free there.

https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free

latchkey · 2025-05-28T18:52:53 1748458373

It is free to use, but you're feeding OR data and someone is profiting off that.

ankit219 · 2025-05-28T19:02:47 1748458967

Thats how a lot of application layer startups are going to make money. There is a bunch of high quality usage data. Either you monetize it yourself (cursor), get acquired (windsurf) or provide that data to others at a fee (lmsys, mercor). This is inevitable and a market for this is just going to increaase. If you want to prevent this as an org, there arent many ways out. Either use open source models you can deploy, or deal directly with model providers where you can sign specific contracts.

85392_school · 2025-05-28T19:08:01 1748459281

You're actually sending data to random GPUs connected to one of the Bittensor subnets that run LLMs.

latchkey · 2025-05-28T19:37:58 1748461078

That can, today, collect that data and sell it. There is work being done to add TEE, but it isn't live yet.

dist-epoch · 2025-05-28T19:24:52 1748460292

Not every prompt is privacy sensitive.

For example you could use it to summarize a public article.

latchkey · 2025-05-28T19:43:51 1748461431

Every prompt is valuable.

criddell · 2025-05-28T20:01:21 1748462481

And you are getting something valuable in return. It's probably a good trade for many, especially when they are doing something like summarizing a public article.

jacob019 · 2025-05-28T22:15:18 1748470518

I'm not so sure. I have agents that do categorization work. Take a title, drill through a browse tree to find the most applicable leaf category. Lots of other classification tasks that are not particularly sensitive and it's hard to imagine them being very good for training. Also transformations of anonymized numerical data, parsing, etc.

latchkey · 2025-05-29T17:04:56 1748538296

"one man's garbage is another man's treasure"

dist-epoch · 2025-05-28T20:01:54 1748462514

Using an AI for free is also valuable. Seems win/win.

latchkey · 2025-05-29T17:06:48 1748538408

This isn’t about reciprocal value. Even if something isn't privacy sensitive, it still holds value.

SkyPuncher · 2025-05-28T18:54:56 1748458496

Practically, smaller, quantized versions of R1 can be run on a pretty typically Macbook Pro setup. Quantized versions are definitely less performant, but they will absolutely run.

Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.

hadlock · 2025-05-28T18:37:30 1748457450

As mentioned you can run this on a server board with 768+ gb memory in cpu mode. Average joe is going to be running quantized 30b (not 600b+) models on an $300/$400/$900 8/12/16gb GPU

rahimnathwani · 2025-05-28T18:44:03 1748457843

I'm not sure that's enough RAM to run it at full precision (FP8).

This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205

jacob019 · 2025-05-28T18:24:51 1748456691

I'm sure it will be on OpenRouter within the next day or so. Not really practical to run a 685B param model at home.

jazzyjackson · 2025-05-28T23:28:08 1748474888

You can pay Amazon to do it for you at about a penny per 10 thousand tokens.

There's a couple of guides for setting it up "manually" on ec2 instances so you're not paying the Bedrock per-token-prices, here's [1] that states four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU)

Quick google tells me that g6e.48xlarge is something like 22k USD per month?

[0] https://aws.amazon.com/bedrock/deepseek/

[1] https://community.aws/content/2w2T9a1HOICvNCVKVRyVXUxuKff/de...

z2 · 2025-05-28T21:01:35 1748466095

Hardware: any computer from the last 20 or so years.

Software: client of choice to https://openrouter.ai/deepseek/deepseek-r1-0528

Sorry I'm being cheeky here, but realistically unless you want to shell out 10k for the equivalent of a Mac Studio with 512GB of RAM, you are best using other services or a small distilled model based on this one.

threeducks · 2025-05-28T20:43:48 1748465028

> even at a glacial pace

If speed is truly not an issue, you can run Deepseek on pretty much any PC with a large enough swap file, at a speed of about one token every 10 minutes assuming a plain old HDD.

Something more reasonable would be a used server CPU with as many memory channels as possible and DDR4 ram for less than $2000.

But before spending big, it might be a good idea to rent a server to get a feel for it.

whynotmaybe · 2025-05-29T21:54:45 1748555685

I'm using GPT4All with DeepSeek-R1-Distill-QWen-7B (which is not R1-0528) on a Ryzen 5 3600 with 32Gb ram.

With an average of 3.6 tokens/sec, answers usually take 150-200 seconds.