Jamba - Mamba & Transformers & MoE

Motivation§

The domain of ML has made some remarkable progress in the last 6 years. A lot of credit goes to a capable architecture - the Transformer. Transformer based models are very capable but are not very efficient in memory or compute. Recently another architecture has shown real promise in terms of efficiency and memory - Mamba. But they lag behind in performance.

One might ask, is there some way the two can be combined (if they can be combined), to create a novel architecture such that one guards against the other's shortcomings and together the combined model can be equally as capable as transformer or better. The Jamba paper explores just that.

Background§

Let's compare Mamba and Transformer head to head.

CategoryTransformerMambaComment
Memory footprintHighLowLower is better
Compute requirementHighLowLower is better
Trained efficientlyYesYes-
PerformanceBestLower than transformer-
Supports very long contextNoYesLonger the context the better
In context learningYesNo-

Results§

Throughput§

Impressive throughput at high context lengths. The large llama model does not even get to this context length. It bears mentioning that the released Jamba model supports a context length of 256K and it has been tested for up to 1M tokens. alt text

Memory Footprint§

Jamba can work with 1/8th the K/V cache size compared a pure Transformer based model.

Performance§

Jamba is descent in reasoning - wins on 3 out of the 5 categories. But on the two that it looses, it looses by a wide margin to the winner. On the ones it wins, it wins by a small margin. It does quite poorly on comprehension and aggregate categories.

It does well on Needle in a haystack problems.

Architecture§

12 B active parameters and 52 B total available parameters. Training runs up to 250 B tokens.

The architecture uses 4 such blocks.

Other findings§

  • Pure Mamba struggles with in-context learning.
  • Jamba does not need positional embeddings.
  • The ratio of attention to Mamba layers may be 1:3 or 1:7 with no performance difference. This was done on a 1.3B parameter model trained with 250B tokens.
  • The paper claims that the Jamba

Thoughts§

The research shows that Transformer and Mamba can be combined in meaningful ways to create efficient models in terms of training compute and memory but Transformers are still the king when it comes to state of the art in performance. Whether they will continue to be the crown less kings - remains to be seen. Mamba is seeing some love from the community and time will tell if some hybrid architecture can end the reign of the transformers.

Reference§

  1. https://arxiv.org/pdf/2403.19887.pdf