Long Context Is Expensive — and the Alternatives Are Messy
As context windows grow to 128K, 512K, even 1M tokens, the quadratic cost of full attention becomes the primary bottleneck for LLM inference. The natural response is sparse attention: instead of attending to all tokens, pick a small subset and attend only to those.
Two broad families of approaches exist. Native sparse models (e.g. Kimi Delta Attention, DeepSeek Sparse Attention) retrain from scratch with sparsity built into the architecture — accurate but expensive. Post-hoc heuristics (SnapKV, MInference, FlexPrefill) identify important tokens at inference time with no training — cheap but lossy, especially on multi-hop reasoning and dispersed-evidence tasks.
RTPurbo challenges the premise that you need to choose. Its key claim: full-attention models already encode a rich sparsity structure internally. You don't need to train sparsity in — you need to find it and expose it.
Three specific challenges block any post-hoc approach from reaching native-sparse quality: identifying which heads actually need long context, efficiently selecting which tokens those heads should attend to, and deciding how many tokens — a number that changes with every query.
85% of Heads Only Care About Local Context
The first insight is structural. Attention heads in pretrained LLMs are not interchangeable — they specialize. Most heads process local information: the last few hundred tokens, syntax patterns, short-range dependencies. A small subset — roughly 15% — are retrieval heads that actively seek out semantically related tokens no matter where they appeared in the document.
This head specialization is stable and input-agnostic. It can be measured offline with a single calibration sequence: insert an identical "needle" span at both the start and end of a long document, then measure how much attention mass each head directs from the later needle back to the earlier one.
For retrieval heads, RTPurbo keeps the full KV cache — every past token is available for sparse selection. For local heads, it applies a simple sliding window (8192 tokens) plus attention sinks (4 tokens), which is all these heads need and costs a fraction of full attention. This head-wise design means the system's sparsity budget is spent exactly where it matters.
Long-Range Retrieval Lives in 16 Dimensions
Even knowing which heads do retrieval, computing full-dimensional attention scores to decide which tokens to attend to is still expensive — you'd need to score all N tokens just to pick the top subset. RTPurbo's second insight cuts this cost dramatically.
The key is the mathematics of RoPE (Rotary Position Embedding). When a query at position m scores a key at position n, the score decomposes into a sum over rotary frequency bands:
Low-freq bands (small θ_i): vary smoothly → preserve retrieval signal
High-frequency components oscillate rapidly with distance and become unreliable at long range — they actively hurt retrieval. Low-frequency components vary smoothly and carry the semantic signal. This means you can estimate long-range token relevance using only the low-frequency dimensions of pre-RoPE representations — a much smaller space.
In practice, a learned linear projection WQ, WK ∈ ℝr×d with r = 16 dimensions achieves over 90% recall of the tokens that matter — using 8× fewer dimensions than the full 128-dimensional key. Token scoring via this projector runs in 1/8th the FLOPs of full-dimensional scoring, making it a practical routing mechanism.
Top-k Is Broken: Every Query Needs a Different Budget
Knowing which tokens are most relevant still leaves open the question: how many tokens should we keep? The natural answer — pick a fixed top-k — turns out to be fundamentally wrong.
Different queries induce radically different attention distributions on retrieval heads. Consider two extremes: a "Galápagos" query in a 35K-token passage about Pacific islands requires broad diffuse retrieval — you need ~8,504 tokens to capture 90% of the attention mass, and top-4096 only gets you 77.6%. A needle-in-a-haystack query for a hidden password needs just 2 tokens for 96.6% recall — top-4096 wastes nearly 4094 unnecessary reads.
Top-p solves this cleanly: select the minimal set of tokens whose cumulative attention mass reaches threshold p = 0.9. For diffuse queries this gives you ~8,504 tokens. For concentrated queries it gives you 2. The budget adapts automatically — no tuning required.
The Numbers Behind the Claim
| Method | Tokens used | Recall (attn mass) | Waste |
|---|---|---|---|
| top-2k | 2,048 | 64.2% | under-retrieves |
| top-4k | 4,096 | 77.6% | under-retrieves |
| top-16k | 16,384 | 93.8% | ~8k wasted tokens |
| top-p (0.9) | 8,504 | 90.0% | query-adaptive ✓ |
Numbers for the "Galápagos" query (35K context). Top-16k computes ~8k extra tokens vs top-p but gains only 3.8% more recall.
Two Stages, 600 Steps, Minimal Surgery
Once the head partition is done offline (one calibration pass), RTPurbo needs two lightweight training stages to restore full accuracy under sparsity. The total cost: about 600 gradient steps on text with average length 48K tokens — roughly 1M label tokens. This is orders of magnitude cheaper than native sparse pretraining.
The self-distillation approach is particularly elegant: by aligning to the dense teacher's top-10 logits rather than a labeled dataset, the training avoids the fragile data-mixture ablations that typically consume weeks of experimentation. And because the backbone stays frozen in Stage 1, the projection weights are the only parameters being learned — tiny 16×128 matrices per retrieval head.
Near-Lossless at 9.36× Prefill Speedup
RTPurbo is evaluated on two benchmark categories: long-context (LongBench, RULER) using Qwen3-Coder-30B-A3B, and reasoning (AIME24, AIME25, MMLU-PRO) using Qwen3-30B-A3B-Think. The central claim — near-lossless with dynamic top-p — holds across both.
RULER Benchmark (64K context)
RULER 64K. RTPurbo top-p matches RazorAttn (+0.38) while delivering 9.36× prefill speedup. The top-k ablation drops 15 points, confirming the necessity of dynamic thresholding.
Reasoning: AIME Benchmarks
Reasoning tasks stress the decode phase: prompts are short (<300 tokens) but reasoning traces can exceed 32K tokens. RTPurbo with dynamic top-p perfectly matches full attention at 86.67% on both AIME24 and AIME25, while Quest drops to 46.67% and SnapKV to 43.33–46.67%.
Prefill Speedup vs Context Length
Ultra-Long: Staying Accurate Past 512K
At extreme context lengths (128K–512K), MInference and FlexPrefill collapse: accuracy on multi-hop tasks drops from ~90% at 128K to near zero at 512K. RTPurbo sustains accuracy above 80% at 512K while pushing compute sparsity above 97% — dynamic top-p selects fewer and fewer tokens as context grows, yet the selected tokens carry nearly all the attention mass.
| Context | Task | Compute Sparsity | Active Tokens | Attn Mass |
|---|---|---|---|---|
| 32K | niah-S | 78.7% | 468.8 | >0.95 |
| 32K | multi-K | 77.8% | 2,462.1 | >0.96 |
| 64K | niah-S | 89.2% | 1,126.8 | >0.93 |
| 64K | multi-K | 88.7% | 3,316.1 | >0.94 |
The 5× variance in active tokens between niah-S (469) and multi-K (2462) at 32K context demonstrates why static top-k is fundamentally broken for heterogeneous workloads.