Jamba - Mamba & Transformers & MoE

2024.04.06 [Updated: 2024.04.08] :: {ai} :: #llm #ai #learning #ai21labs #intermediate #architecture

Table of Contents

Motivation§

The domain of ML has made some remarkable progress in the last 6 years. A lot of credit goes to a capable architecture - the Transformer. Transformer based models are very capable but are not very efficient in memory or compute. Recently another architecture has shown real promise in terms of efficiency and memory - Mamba. But they lag behind in performance.

One might ask, is there some way the two can be combined (if they can be combined), to create a novel architecture such that one guards against the other's shortcomings and together the combined model can be equally as capable as transformer or better. The Jamba paper explores just that.

Background§

Let's compare Mamba and Transformer head to head.

Category	Transformer	Mamba	Comment
Memory footprint	High	Low	Lower is better
Compute requirement	High	Low	Lower is better
Trained efficiently	Yes	Yes	-
Performance	Best	Lower than transformer	-
Supports very long context	No	Yes	Longer the context the better
In context learning	Yes	No	-

Results§

Throughput§

Impressive throughput at high context lengths. The large llama model does not even get to this context length. It bears mentioning that the released Jamba model supports a context length of 256K and it has been tested for up to 1M tokens. alt text

Memory Footprint§

Jamba can work with 1/8th the K/V cache size compared a pure Transformer based model.

Performance§

Jamba is descent in reasoning - wins on 3 out of the 5 categories. But on the two that it looses, it looses by a wide margin to the winner. On the ones it wins, it wins by a small margin. It does quite poorly on comprehension and aggregate categories.

It does well on Needle in a haystack problems.

Architecture§

12 B active parameters and 52 B total available parameters. Training runs up to 250 B tokens.

The architecture uses 4 such blocks.

Other findings§

Pure Mamba struggles with in-context learning.
Jamba does not need positional embeddings.
The ratio of attention to Mamba layers may be 1:3 or 1:7 with no performance difference. This was done on a 1.3B parameter model trained with 250B tokens.
The paper claims that the Jamba

Thoughts§

The research shows that Transformer and Mamba can be combined in meaningful ways to create efficient models in terms of training compute and memory but Transformers are still the king when it comes to state of the art in performance. Whether they will continue to be the crown less kings - remains to be seen. Mamba is seeing some love from the community and time will tell if some hybrid architecture can end the reign of the transformers.

Reference§

https://arxiv.org/pdf/2403.19887.pdf