RTPurbo: Full Attention Strikes Back

01 — The Problem

Long Context Is Expensive — and the Alternatives Are Messy

As context windows grow to 128K, 512K, even 1M tokens, the quadratic cost of full attention becomes the primary bottleneck for LLM inference. The natural response is sparse attention: instead of attending to all tokens, pick a small subset and attend only to those.

Two broad families of approaches exist. Native sparse models (e.g. Kimi Delta Attention, DeepSeek Sparse Attention) retrain from scratch with sparsity built into the architecture — accurate but expensive. Post-hoc heuristics (SnapKV, MInference, FlexPrefill) identify important tokens at inference time with no training — cheap but lossy, especially on multi-hop reasoning and dispersed-evidence tasks.

RTPurbo challenges the premise that you need to choose. Its key claim: full-attention models already encode a rich sparsity structure internally. You don't need to train sparsity in — you need to find it and expose it.

sparse-attention-tradeoffs.svg the design space before RTPurbo

Three specific challenges block any post-hoc approach from reaching native-sparse quality: identifying which heads actually need long context, efficiently selecting which tokens those heads should attend to, and deciding how many tokens — a number that changes with every query.

02 — Head Specialization

85% of Heads Only Care About Local Context

The first insight is structural. Attention heads in pretrained LLMs are not interchangeable — they specialize. Most heads process local information: the last few hundred tokens, syntax patterns, short-range dependencies. A small subset — roughly 15% — are retrieval heads that actively seek out semantically related tokens no matter where they appeared in the document.

This head specialization is stable and input-agnostic. It can be measured offline with a single calibration sequence: insert an identical "needle" span at both the start and end of a long document, then measure how much attention mass each head directs from the later needle back to the earlier one.

head-types-canvas click a head to see its attention pattern

Retrieval head (15%) — attends far Local head (85%) — attends nearby

For retrieval heads, RTPurbo keeps the full KV cache — every past token is available for sparse selection. For local heads, it applies a simple sliding window (8192 tokens) plus attention sinks (4 tokens), which is all these heads need and costs a fraction of full attention. This head-wise design means the system's sparsity budget is spent exactly where it matters.

Retrieval score per head (offline calibration) R_h = (1/|N post |) \cdot Σ t\inN post Σ j\inN pre A_h(t,j) \leftarrow attention mass from later needle \to earlier needle Heads with high R_h \to retrieval set H ret (top 15%) | rest \to local set H loc (sliding window)

03 — Low-Dimensional Indexer

Long-Range Retrieval Lives in 16 Dimensions

Even knowing which heads do retrieval, computing full-dimensional attention scores to decide which tokens to attend to is still expensive — you'd need to score all N tokens just to pick the top subset. RTPurbo's second insight cuts this cost dramatically.

The key is the mathematics of RoPE (Rotary Position Embedding). When a query at position m scores a key at position n, the score decomposes into a sum over rotary frequency bands:

RoPE attention score decomposition s(m,n) = Σ i=1..D [a_i cos(θ_i \cdotΔ) + b_i sin(θ_i \cdotΔ)] where Δ = m-n (relative offset) High-freq bands (large θ_i): vary rapidly with Δ \to noisy at long range Low-freq bands (small θ_i): vary smoothly \to preserve retrieval signal

High-frequency components oscillate rapidly with distance and become unreliable at long range — they actively hurt retrieval. Low-frequency components vary smoothly and carry the semantic signal. This means you can estimate long-range token relevance using only the low-frequency dimensions of pre-RoPE representations — a much smaller space.

indexer-dim-canvas recall vs indexer dimensionality — drag or click

In practice, a learned linear projection W^Q, W^K ∈ ℝ^r×d with r = 16 dimensions achieves over 90% recall of the tokens that matter — using 8× fewer dimensions than the full 128-dimensional key. Token scoring via this projector runs in 1/8th the FLOPs of full-dimensional scoring, making it a practical routing mechanism.

Low-dim pre-RoPE token scoring ŝ_h(m,n) = (W h Q \cdot q pre m,h) ⊤ (W h K \cdot k pre n,h) r=16 ≪ d=128 — 8\times cheaper scoring This selects which tokens to attend to. Actual attention still uses full d=128 dimensions with RoPE.

04 — Dynamic Top-p

Top-k Is Broken: Every Query Needs a Different Budget

Knowing which tokens are most relevant still leaves open the question: how many tokens should we keep? The natural answer — pick a fixed top-k — turns out to be fundamentally wrong.

Different queries induce radically different attention distributions on retrieval heads. Consider two extremes: a "Galápagos" query in a 35K-token passage about Pacific islands requires broad diffuse retrieval — you need ~8,504 tokens to capture 90% of the attention mass, and top-4096 only gets you 77.6%. A needle-in-a-haystack query for a hidden password needs just 2 tokens for 96.6% recall — top-4096 wastes nearly 4094 unnecessary reads.

attention-distribution-canvas click to switch query type — see how top-k fares

Top-p solves this cleanly: select the minimal set of tokens whose cumulative attention mass reaches threshold p = 0.9. For diffuse queries this gives you ~8,504 tokens. For concentrated queries it gives you 2. The budget adapts automatically — no tuning required.

Dynamic top-p selection rule S_h(m) = Top-P(ŝ_h(m, ·), p = 0.9) ← smallest set s.t. Σ softmax(ŝ) ≥ 0.9 Actual attention: O_h(m) = Σ_{n∈S_h} softmax(q·k/√d) · v_n  using full-dimensional q,k,v with RoPE — exact for selected tokens

The Numbers Behind the Claim

Method	Tokens used	Recall (attn mass)	Waste
top-2k	2,048	64.2%	under-retrieves
top-4k	4,096	77.6%	under-retrieves
top-16k	16,384	93.8%	~8k wasted tokens
top-p (0.9)	8,504	90.0%	query-adaptive ✓

Numbers for the "Galápagos" query (35K context). Top-16k computes ~8k extra tokens vs top-p but gains only 3.8% more recall.

05 — Training Pipeline

Two Stages, 600 Steps, Minimal Surgery

Once the head partition is done offline (one calibration pass), RTPurbo needs two lightweight training stages to restore full accuracy under sparsity. The total cost: about 600 gradient steps on text with average length 48K tokens — roughly 1M label tokens. This is orders of magnitude cheaper than native sparse pretraining.

Offline Head Calibration

Run one long document with a "needle" at start and end. Measure per-head attention mass from the later needle to the earlier one. Partition heads into H_ret (top 15%) and H_loc. Done once, never repeated.

R_h = mean(A_h[t∈Npost, j∈Npre]) # single calibration pass

Stage 1: Projection Training (backbone frozen)

Keep the LLM frozen. For each retrieval head, train a 16-dim projection W^Q, W^K ∈ ℝ^16×128 to approximate the full attention distribution. Minimize KL divergence between projected-score attention and exact attention.

L_proj = Σ_h KL(a_full_h ∥ a_proj_h(W_h^Q, W_h^K))

Stage 2: Self-Distillation (~600 steps)

Switch to sparse attention mode. Run the sparse student alongside the dense teacher. Align top-10 logits via KL — no dataset-specific labels, bypassing data-mixture tuning. The sparse model learns to match the dense model's predictions exactly.

L_distill = KL(softmax(z_sparse_top10) ∥ softmax(z_dense_top10))

Inference: Sort-Free Top-p Kernel

A custom GPU kernel uses a 256-bin histogram (1 KB per head) to threshold scores without sorting (O(N) vs O(N log N)). Block-sparse D128 attention runs on selected tokens. Two kernel launches replace one FlashAttention-2 pass.

histogram[bin] += ℓ_b # atomic, 1KB/head, no sort needed

Step 1 of 4

The self-distillation approach is particularly elegant: by aligning to the dense teacher's top-10 logits rather than a labeled dataset, the training avoids the fragile data-mixture ablations that typically consume weeks of experimentation. And because the backbone stays frozen in Stage 1, the projection weights are the only parameters being learned — tiny 16×128 matrices per retrieval head.

06 — Results

Near-Lossless at 9.36× Prefill Speedup

RTPurbo is evaluated on two benchmark categories: long-context (LongBench, RULER) using Qwen3-Coder-30B-A3B, and reasoning (AIME24, AIME25, MMLU-PRO) using Qwen3-30B-A3B-Think. The central claim — near-lossless with dynamic top-p — holds across both.

RULER Benchmark (64K context)

Full Attention

86.23

RazorAttn

85.11

FlexPrefill

77.77

MInference

65.61

Quest

70.60

RTPurbo (top-k)

70.53

RTPurbo (top-p)

85.49

RULER 64K. RTPurbo top-p matches RazorAttn (+0.38) while delivering 9.36× prefill speedup. The top-k ablation drops 15 points, confirming the necessity of dynamic thresholding.

Reasoning: AIME Benchmarks

Reasoning tasks stress the decode phase: prompts are short (<300 tokens) but reasoning traces can exceed 32K tokens. RTPurbo with dynamic top-p perfectly matches full attention at 86.67% on both AIME24 and AIME25, while Quest drops to 46.67% and SnapKV to 43.33–46.67%.

Prefill Speedup vs Context Length

speedup-canvas prefill speedup over FlashAttention-2 at each context length

Ultra-Long: Staying Accurate Past 512K

At extreme context lengths (128K–512K), MInference and FlexPrefill collapse: accuracy on multi-hop tasks drops from ~90% at 128K to near zero at 512K. RTPurbo sustains accuracy above 80% at 512K while pushing compute sparsity above 97% — dynamic top-p selects fewer and fewer tokens as context grows, yet the selected tokens carry nearly all the attention mass.

Context	Task	Compute Sparsity	Active Tokens	Attn Mass
32K	niah-S	78.7%	468.8	>0.95
32K	multi-K	77.8%	2,462.1	>0.96
64K	niah-S	89.2%	1,126.8	>0.93
64K	multi-K	88.7%	3,316.1	>0.94

The 5× variance in active tokens between niah-S (469) and multi-K (2462) at 32K context demonstrates why static top-k is fundamentally broken for heterogeneous workloads.