Sparse Attention Long Context KV Cache Interactive

Full Attention Strikes Back:
9.36× Faster in 600 Steps

Full-attention LLMs are already intrinsically sparse — they just don't know it yet. RTPurbo exploits this by identifying the 15% of attention heads that do real long-range retrieval, routing them through a 16-dimensional indexer, and using dynamic top-p to adapt token budgets per query. The result: near-lossless accuracy with a 9.36× prefill speedup at 1M context, needing only ~600 training steps to get there.

June 1, 2026 ~14 min read Paper: arXiv:2605.16928
01 — The Problem

Long Context Is Expensive — and the Alternatives Are Messy

As context windows grow to 128K, 512K, even 1M tokens, the quadratic cost of full attention becomes the primary bottleneck for LLM inference. The natural response is sparse attention: instead of attending to all tokens, pick a small subset and attend only to those.

Two broad families of approaches exist. Native sparse models (e.g. Kimi Delta Attention, DeepSeek Sparse Attention) retrain from scratch with sparsity built into the architecture — accurate but expensive. Post-hoc heuristics (SnapKV, MInference, FlexPrefill) identify important tokens at inference time with no training — cheap but lossy, especially on multi-hop reasoning and dispersed-evidence tasks.

RTPurbo challenges the premise that you need to choose. Its key claim: full-attention models already encode a rich sparsity structure internally. You don't need to train sparsity in — you need to find it and expose it.

sparse-attention-tradeoffs.svg the design space before RTPurbo
Training Cost → Accuracy → low cost, low accuracy high cost, high accuracy Post-hoc heuristics SnapKV · MInference FlexPrefill · Quest Native sparse Kimi Delta · DeepSeek SA (retrain from scratch) Full Attn RTPurbo ~600 steps near-lossless accuracy

Three specific challenges block any post-hoc approach from reaching native-sparse quality: identifying which heads actually need long context, efficiently selecting which tokens those heads should attend to, and deciding how many tokens — a number that changes with every query.

02 — Head Specialization

85% of Heads Only Care About Local Context

The first insight is structural. Attention heads in pretrained LLMs are not interchangeable — they specialize. Most heads process local information: the last few hundred tokens, syntax patterns, short-range dependencies. A small subset — roughly 15% — are retrieval heads that actively seek out semantically related tokens no matter where they appeared in the document.

This head specialization is stable and input-agnostic. It can be measured offline with a single calibration sequence: insert an identical "needle" span at both the start and end of a long document, then measure how much attention mass each head directs from the later needle back to the earlier one.

head-types-canvas click a head to see its attention pattern
Retrieval head (15%) — attends far Local head (85%) — attends nearby

For retrieval heads, RTPurbo keeps the full KV cache — every past token is available for sparse selection. For local heads, it applies a simple sliding window (8192 tokens) plus attention sinks (4 tokens), which is all these heads need and costs a fraction of full attention. This head-wise design means the system's sparsity budget is spent exactly where it matters.

Retrieval score per head (offline calibration)
R_h = (1/|Npost|) · Σt∈Npost Σj∈Npre A_h(t,j) ← attention mass from later needle → earlier needle
Heads with high R_h → retrieval set Hret (top 15%)  |  rest → local set Hloc (sliding window)
03 — Low-Dimensional Indexer

Long-Range Retrieval Lives in 16 Dimensions

Even knowing which heads do retrieval, computing full-dimensional attention scores to decide which tokens to attend to is still expensive — you'd need to score all N tokens just to pick the top subset. RTPurbo's second insight cuts this cost dramatically.

The key is the mathematics of RoPE (Rotary Position Embedding). When a query at position m scores a key at position n, the score decomposes into a sum over rotary frequency bands:

RoPE attention score decomposition
s(m,n) = Σi=1..D [a_i cos(θ_i·Δ) + b_i sin(θ_i·Δ)] where Δ = m−n (relative offset)
High-freq bands (large θ_i): vary rapidly with Δ → noisy at long range
Low-freq bands (small θ_i): vary smoothly → preserve retrieval signal

High-frequency components oscillate rapidly with distance and become unreliable at long range — they actively hurt retrieval. Low-frequency components vary smoothly and carry the semantic signal. This means you can estimate long-range token relevance using only the low-frequency dimensions of pre-RoPE representations — a much smaller space.

indexer-dim-canvas recall vs indexer dimensionality — drag or click

In practice, a learned linear projection WQ, WK ∈ ℝr×d with r = 16 dimensions achieves over 90% recall of the tokens that matter — using 8× fewer dimensions than the full 128-dimensional key. Token scoring via this projector runs in 1/8th the FLOPs of full-dimensional scoring, making it a practical routing mechanism.

Low-dim pre-RoPE token scoring
ŝ_h(m,n) = (WhQ · qprem,h) (WhK · kpren,h) r=16 ≪ d=128 — 8× cheaper scoring
This selects which tokens to attend to. Actual attention still uses full d=128 dimensions with RoPE.
04 — Dynamic Top-p

Top-k Is Broken: Every Query Needs a Different Budget

Knowing which tokens are most relevant still leaves open the question: how many tokens should we keep? The natural answer — pick a fixed top-k — turns out to be fundamentally wrong.

Different queries induce radically different attention distributions on retrieval heads. Consider two extremes: a "Galápagos" query in a 35K-token passage about Pacific islands requires broad diffuse retrieval — you need ~8,504 tokens to capture 90% of the attention mass, and top-4096 only gets you 77.6%. A needle-in-a-haystack query for a hidden password needs just 2 tokens for 96.6% recall — top-4096 wastes nearly 4094 unnecessary reads.

attention-distribution-canvas click to switch query type — see how top-k fares

Top-p solves this cleanly: select the minimal set of tokens whose cumulative attention mass reaches threshold p = 0.9. For diffuse queries this gives you ~8,504 tokens. For concentrated queries it gives you 2. The budget adapts automatically — no tuning required.

Dynamic top-p selection rule
S_h(m) = Top-P(ŝ_h(m, ·), p = 0.9) ← smallest set s.t. Σ softmax(ŝ) ≥ 0.9
Actual attention: O_h(m) = Σ_{n∈S_h} softmax(q·k/√d) · v_n  using full-dimensional q,k,v with RoPE — exact for selected tokens

The Numbers Behind the Claim

Method Tokens used Recall (attn mass) Waste
top-2k2,04864.2%under-retrieves
top-4k4,09677.6%under-retrieves
top-16k16,38493.8%~8k wasted tokens
top-p (0.9)8,50490.0%query-adaptive ✓

Numbers for the "Galápagos" query (35K context). Top-16k computes ~8k extra tokens vs top-p but gains only 3.8% more recall.

05 — Training Pipeline

Two Stages, 600 Steps, Minimal Surgery

Once the head partition is done offline (one calibration pass), RTPurbo needs two lightweight training stages to restore full accuracy under sparsity. The total cost: about 600 gradient steps on text with average length 48K tokens — roughly 1M label tokens. This is orders of magnitude cheaper than native sparse pretraining.

1
Offline Head Calibration
Run one long document with a "needle" at start and end. Measure per-head attention mass from the later needle to the earlier one. Partition heads into Hret (top 15%) and Hloc. Done once, never repeated.
R_h = mean(A_h[t∈Npost, j∈Npre]) # single calibration pass
2
Stage 1: Projection Training (backbone frozen)
Keep the LLM frozen. For each retrieval head, train a 16-dim projection WQ, WK ∈ ℝ16×128 to approximate the full attention distribution. Minimize KL divergence between projected-score attention and exact attention.
L_proj = Σ_h KL(a_full_h ∥ a_proj_h(W_h^Q, W_h^K))
3
Stage 2: Self-Distillation (~600 steps)
Switch to sparse attention mode. Run the sparse student alongside the dense teacher. Align top-10 logits via KL — no dataset-specific labels, bypassing data-mixture tuning. The sparse model learns to match the dense model's predictions exactly.
L_distill = KL(softmax(z_sparse_top10) ∥ softmax(z_dense_top10))
4
Inference: Sort-Free Top-p Kernel
A custom GPU kernel uses a 256-bin histogram (1 KB per head) to threshold scores without sorting (O(N) vs O(N log N)). Block-sparse D128 attention runs on selected tokens. Two kernel launches replace one FlashAttention-2 pass.
histogram[bin] += ℓ_b # atomic, 1KB/head, no sort needed
Step 1 of 4

The self-distillation approach is particularly elegant: by aligning to the dense teacher's top-10 logits rather than a labeled dataset, the training avoids the fragile data-mixture ablations that typically consume weeks of experimentation. And because the backbone stays frozen in Stage 1, the projection weights are the only parameters being learned — tiny 16×128 matrices per retrieval head.

06 — Results

Near-Lossless at 9.36× Prefill Speedup

RTPurbo is evaluated on two benchmark categories: long-context (LongBench, RULER) using Qwen3-Coder-30B-A3B, and reasoning (AIME24, AIME25, MMLU-PRO) using Qwen3-30B-A3B-Think. The central claim — near-lossless with dynamic top-p — holds across both.

RULER Benchmark (64K context)

Full Attention
86.23
RazorAttn
85.11
FlexPrefill
77.77
MInference
65.61
Quest
70.60
RTPurbo (top-k)
70.53
RTPurbo (top-p)
85.49

RULER 64K. RTPurbo top-p matches RazorAttn (+0.38) while delivering 9.36× prefill speedup. The top-k ablation drops 15 points, confirming the necessity of dynamic thresholding.

Reasoning: AIME Benchmarks

Reasoning tasks stress the decode phase: prompts are short (<300 tokens) but reasoning traces can exceed 32K tokens. RTPurbo with dynamic top-p perfectly matches full attention at 86.67% on both AIME24 and AIME25, while Quest drops to 46.67% and SnapKV to 43.33–46.67%.

Prefill Speedup vs Context Length

speedup-canvas prefill speedup over FlashAttention-2 at each context length

Ultra-Long: Staying Accurate Past 512K

At extreme context lengths (128K–512K), MInference and FlexPrefill collapse: accuracy on multi-hop tasks drops from ~90% at 128K to near zero at 512K. RTPurbo sustains accuracy above 80% at 512K while pushing compute sparsity above 97% — dynamic top-p selects fewer and fewer tokens as context grows, yet the selected tokens carry nearly all the attention mass.

ContextTaskCompute SparsityActive TokensAttn Mass
32Kniah-S78.7%468.8>0.95
32Kmulti-K77.8%2,462.1>0.96
64Kniah-S89.2%1,126.8>0.93
64Kmulti-K88.7%3,316.1>0.94

The 5× variance in active tokens between niah-S (469) and multi-K (2462) at 32K context demonstrates why static top-k is fundamentally broken for heterogeneous workloads.