Why KV Cache Is the Bottleneck in Long-Context Inference
Every token you generate reads from the KV cache — the stored keys and values for every past token at every layer. In standard multi-head attention (MHA) with 128 heads of dimension 128, that's 32,768 floats per token per layer. Run a 100-layer model on a 128K-token context and you need 100 × 128K × 32,768 × 2 bytes ≈ 838 GB of KV cache alone.
This is why long-context inference is so expensive: the model parameters fit in GPU memory, but the KV cache grows without bound as sequences get longer. Batching multiple users together makes it even worse — every concurrent request needs its own KV cache.
Sequence length × cache size grows fast
The interactive chart below shows how KV cache memory scales with sequence length for each attention variant in a 100-layer model. Drag the slider to feel how quickly MHA's cache explodes while MLA stays nearly flat.
At 128K tokens — DeepSeek-V2's context length — MHA needs ~838 GB of KV cache per batch entry at full precision. MLA needs only ~55 GB. That difference is the entire A100 you'd need to dedicate just to KV storage.
Compress Down, Expand Up — And Cache Only the Small Vector
The key insight of MLA is embarrassingly simple: keys and values live in a high-dimensional space, but a lower-dimensional latent vector is enough to reconstruct them. Instead of caching the full keys and values, cache the compressed latent.
v_t = WV · h_t ← cached: n_h × d_h floats = 16,384
k_t = WUK · c_tKV ← recomputed at decode time
v_t = WUV · c_tKV ← recomputed at decode time
The down-projection WDKV maps from d = 5120 to d_c = 512, squeezing the token representation to a latent that's 10× smaller than a single attention head. The up-projections WUK and WUV expand back to the full 128 heads during the decode step.
The same principle applies to queries — a compressed query latent cQ (1536-dim) is computed first, then up-projected to the 128-head query space. This saves activation memory during training but is less critical for inference speed (queries aren't cached).
Merge WUK Into WQ — Eliminate Key Materialization Entirely
Here's where MLA gets clever. During inference, you're computing attention scores as:
The query q_t itself comes from multiplying WUQ by the compressed query latent. So the full chain is:
The matrices (WUQ)T · WUK don't depend on the current position — they can be fused into a single weight matrix WQ' = (WUQ)T · WUK offline, at model-load time. At decode time you never materialize the full keys at all:
score = q_tT · k_j
score = (c_t^Q)T · W^Q' · c_j^KV
The same absorption applies to values: WUV can be merged into the output projection WO. At inference time, the model never materializes full keys or values — it works directly with the compact 512-dim latent vectors, expanding only once when writing the output.
Why Positional Encoding Breaks the Absorption — and the Fix
There's one critical obstacle to the absorption trick: Rotary Position Embeddings (RoPE). RoPE works by rotating the key and query vectors by an angle that depends on position. For standard attention:
If you try to apply RoPE to the compressed key k_j^C = WUK · c_j^KV, the rotation matrix R(j) sits between WQ and WUK in the score computation:
The fix: separate position from content
MLA decouples RoPE from the compressed keys entirely. Each attention head uses a two-part query and key:
- Content part — derived from c^KV via WUK (no RoPE, absorb-able)
- Position part — a separate small key k^R computed directly from h_t via WKR and rotated with RoPE
k_{t,i} = [k_{t,i}^C ; k_t^R] ← content key ++ position key (shared across heads)
q_{t,i}^R = RoPE(WQR · c_t^Q) ← 64-dim per head, RoPE-encoded
k_t^R = RoPE(WKR · h_t) ← 64-dim, shared across all heads, CACHED
The content key k^C is still absorbed into WQ (no RoPE → no position coupling). The position key k^R is cached separately — it's only 64 floats. The final dot product spans both parts:
The position key k_t^R is shared across all 128 attention heads — it doesn't need a per-head copy. This is what keeps its cache cost low. Total cache per token: 512 (content latent) + 64 (position key) = 576 floats.
MHA vs GQA vs MQA vs MLA: The Numbers
Let's make the savings concrete using DeepSeek-V2's actual hyperparameters: n_h = 128 heads, d_h = 128 dims/head, 60 layers. All values in floats per token; multiply by 2 for bfloat16 bytes.
The key surprise: MLA's 576-float cache beats MHA quality while being smaller than even MQA in total effective capacity. Why? Because the latent vector is a richer representation than a single shared key — it retains the full information of h_t, allowing any number of heads to reconstruct distinct keys from it.
At full 128K context
For a single batch entry across 60 layers at 128K tokens (bfloat16, 2 bytes/float):
| Method | Cache / layer / token | Total at 128K ctx | vs MHA |
|---|---|---|---|
| MHA | 32,768 floats | ~503 GB | 1× |
| GQA (g=8) | 2,048 floats | ~31 GB | 16× |
| MQA | 256 floats | ~3.9 GB | 128× |
| MLA | 576 floats | ~8.8 GB | 57× |
DeepSeek-V2 reports a 93.3% KV cache reduction vs. DeepSeek 67B (which used MHA). The difference between the raw 57× factor and the 93.3% reduction comes from the model having fewer total layers (60 vs. 95) and using d_c = 4·d_h rather than a larger head count.
Throughput impact
5.76× faster generation despite 3.5× more total parameters. The KV cache saving frees bandwidth for actual compute.
Better Than MHA at a Fraction of the Cost
The payoff: DeepSeek-V2 (236B params, 21B activated) matches or beats dense models with 67–78B parameters on every major benchmark, while costing 42.5% less to train per trillion tokens. The efficiency gains compound — smaller KV cache means larger effective batch sizes, which means better GPU utilization throughout training.
Full benchmark table
| Benchmark | DeepSeek 67B | LLaMA-3 70B | Mixtral 8×22B | Qwen1.5 72B | DeepSeek-V2 |
|---|---|---|---|---|---|
| MMLU 5-shot | 71.3 | 78.9 | 77.6 | 77.2 | 78.5 |
| BBH 3-shot | 68.7 | 81.0 | 78.9 | 59.9 | 78.9 |
| DROP F1 3-shot | 69.7 | 82.5 | 80.4 | 71.5 | 80.1 |
| GSM8K 8-shot | 63.4 | 83.0 | 80.3 | 77.9 | 79.2 |
| MATH 4-shot | 18.7 | 42.2 | 42.5 | 41.4 | 43.6 |
| HumanEval 0-shot | 45.1 | 48.2 | 53.1 | 43.9 | 48.8 |
| CMMLU 5-shot | 70.8 | 69.3 | 60.0 | 84.3 | 84.0 |
DeepSeek-V2 uses only 21B activated parameters per token (MoE) — its full 236B parameter count is not activated simultaneously. Direct parameter comparison with dense models understates the efficiency advantage.