Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Where do you run a trillion-param model?




If you want to do it at home, ik_llama.cpp has some performance optimizations that make it semi-practical to run a model of this size on a server with lots of memory bandwidth and a GPU or two for offload. You can get 6-10 tok/s with modest hardware workstation hardware. Thinking chews up a lot of tokens though, so it will be a slog.

What kind of server have you used to run a trillion parameter model? I'd love to dig more into this.

Hi Simon. I have a Xeon W5-3435X with a 768GB of DDR5 across 8 channels, iirc it's running at 5800MT/s. It also has 7x A4000s, water cooled to pack them into a desktop case. Very much a compromise build, and I wouldn't recommend Xeon sapphire rapids because the memory bandwidth you get in practice is less than half of what you'd calculate from the specs. If I did it again, I'd build an EPYC machine with 12 channels of DDR5 and put in a single rtx 6000 pro blackwell. That'd be a lot easier and probably a lot faster.

There's a really good thread on level1techs about running DeepSeek at home, and everything there more-or-less applies to Kimi K2.

https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-hom...


If I had to guess, I'd say it's one with lots of memory bandwidth and a GPU or two for offload. (sorry, I had to, happy Friday Jr.)

You let the people at openrouter worry about that for you

Which in turn lets the people at Moonshot AI worry about that for them, the only provider for this model as of now.

Good people over there



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: