Where do you run a trillion-param model?

Gracana · 2025-11-06T18:22:35 1762453355

If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.

simonw · 2025-11-06T18:33:39 1762454019

What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.

Gracana · 2025-11-06T21:01:49 1762462909

Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.

There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...

stronglikedan · 2025-11-06T19:50:01 1762458601

If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)

isoprophlex · 2025-11-06T17:23:06 1762449786

You let the people at openrouter worry about that for you

MurizS · 2025-11-06T18:01:16 1762452076

Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.

skeptrune · 2025-11-06T18:36:37 1762454197

Good people over there