Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

KV cache for dense models is order 50% of parameters. For sparse moe models it can be significantly smaller I believe, but I don’t think it is measured in kb.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: