Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.
The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.
Thank you to the folks at AI21 for making Jamba available!
Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.
https://www.ai21.com/blog/announcing-jamba
Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.
The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.
Thank you to the folks at AI21 for making Jamba available!