Speculative Decoding Scaling Laws LLM Inference Interactive

Speculative Decoding Scaling Laws:
The 200× Rule

Before you train a single draft model, you can predict the optimal size: it should be roughly 200× smaller than your target. SDSL derives this analytically by connecting acceptance rate α to draft/target perplexity, then threading pre-training scaling laws through the throughput formula.

June 1, 2026 ~12 min read Paper: arXiv:2603.11053
01 — The Problem

Expensive Blind Search

Speculative decoding gives you "free" throughput: a tiny draft model proposes tokens, the big target model verifies them all in parallel, and you pay roughly the same FLOPs as one target pass but get 3–7 tokens out. The catch is that the speedup collapses if you pick the wrong draft size. Too small → low acceptance rate → target runs constantly anyway. Too large → draft itself becomes the bottleneck.

Current practice is expensive: pick a set of candidate draft sizes, train each one, run benchmarks, and repeat. For a 70B target this means training and evaluating multiple billion-parameter models — before deploying anything.

Status quo: empirical draft search click Next to walk through each step
1
Guess a draft size
Pick N = 1B and train it on the same data as the 70B target. Costs ~0.5% of the target's budget.
train_draft(N=1B, data=D, target=70B)
2
Benchmark acceptance rate α
Run speculative decoding on a eval set. Measure how often the draft's tokens are accepted. Find α ≈ 0.70.
α = measure_acceptance(draft_1B, target_70B)
3
Try smaller — maybe the draft bottlenecks
250M would be 4× faster to run. Train it, benchmark it. Find α = 0.65 — worse acceptance, similar throughput.
train_draft(N=250M, data=D, target=70B)
4
Try larger — maybe you need more quality
3B might push α above 0.75. Train, benchmark. But now the draft overhead eats the throughput gain.
train_draft(N=3B, data=D, target=70B) # 😰
5
Pick the winner… and hope
After 3–5 training runs you pick the best. But every time the target model changes, you repeat from step 1.
deploy(best_draft) # until target is updated
Step 1 of 5

SDSL short-circuits this loop. Given only the target model size M and the pre-training dataset size D, it predicts the throughput-optimal draft size N* before any draft training begins.

02 — Speculative Decoding 101

The α Parameter

Speculative decoding uses two models: a small draft model Mq and the large target model Mp. Each iteration, Mq proposes γ tokens autoregressively. Mp then scores all γ positions in a single parallel forward pass and either accepts or rejects each token. If token i is rejected, all subsequent tokens are discarded and a fresh token is sampled from an adjusted distribution.

Acceptance rule — token xi is accepted iff:
ri ~ Uniform(0, 1) < p(xi | x<i) / q(xi | x<i)
→ accepted when target probability ≥ draft probability (always), otherwise probabilistically

The key scalar summary of a (draft, target) pair is α — the expected per-token acceptance rate averaged across all prefixes. High α means the draft is well-aligned with the target; low α means frequent rejections and target re-runs.

Throughput vs α — interactive click an α value to see the effect
α =
Throughput at optimal lookahead γ (Equation 4 in SDSL)
T = −log(α) / [ 2N · (α − 1) · W−1( −α(M/N−1) / e ) ]
W₋₁ = Lambert W function, lower branch. M = target size, N = draft size.

The throughput formula is non-trivial — it involves a Lambert W function and depends on the ratio M/N. But the crucial point is: everything is determined by α and the size ratio. If you can predict α before training, you can predict throughput, and therefore find the optimal N.

03 — The α Affine Law

α = Ax + By + C

The core empirical finding of SDSL is that α is almost entirely determined by the draft model's perplexity — and a simple affine function fits it remarkably well. The paper measures α for 13 target models (OPT, Qwen 1.5/2.5, LLaMA 3/3.1, Seed-OSS) paired with 9 draft models ranging from 125M to 3B parameters, across HellaSwag.

Affine α scaling law (Eq. 5 in SDSL)
α = A·x + B·y + C
x = draft perplexity,   y = target perplexity
Fitted: A = −0.0067,  B = +0.013,  C = +0.642   (R² = 0.60 overall)

The headline result: draft perplexity drives α; target perplexity barely matters. Per-target fits (draft PPL only) reach R² = 0.97–0.98. The B coefficient (+0.013) is much smaller than |A| (0.0067), and its effect is inconsistent across target families.

α vs draft perplexity — real data click a target model to highlight its curve

The scatter plots reveal the asymmetry clearly: as draft perplexity decreases from 30 → 12, α climbs monotonically from ~0.60 to ~0.72 for LLaMA3.1-70B. The same draft model gives nearly identical α regardless of whether the target is a 14B or 110B model — only the target's absolute perplexity shifts the curve slightly.

This discovery is what makes SDSL tractable. You don't need to train the draft to know α — you just need to know its perplexity, which you can predict from Chinchilla-style pre-training scaling laws.

04 — Throughput Sweet Spot

The Peak at N*

Plugging the α affine law and the pre-training scaling law (perplexity as a function of N, D) into the throughput formula gives throughput purely as a function of the training hyperparameters M, N, D. For any fixed target model M and training dataset D, throughput forms a unimodal curve over draft size N — peaking at an optimal N*.

The intuition: too-small drafts have high perplexity → low α → target runs constantly. Too-large drafts have low perplexity → high α, but the draft itself is expensive and dominates the per-step cost. The sweet spot balances both.

Throughput vs draft size N — click a target model star ★ marks predicted optimal N*

Empirically, the predicted N* aligns very well with the measured throughput-optimal draft sizes across OPT, Qwen 1.5, Qwen 2.5, LLaMA 3/3.1, and Seed-OSS model families (see Table 5 in the paper). The optimal draft is consistently in the 100–400M range for 13B–70B target models.

05 — The 200× Rule

N* Scales Linearly with M

Across all model families and dataset sizes, the throughput-optimal draft size follows a clean linear law:

SDSL optimal draft size (Eq. 11 in SDSL)
N*(M) = µ · M + M₀
µ = 2.71 × 10−3,   M₀ = 87.1 × 106 params
→ as M → ∞, N*/M → 1/369 ≈ 0.27% — the draft is ~370× smaller

For realistic models, the M₀ offset term adds 87M parameters of "fixed overhead" — so for a 13B target the draft is about 107× smaller, and for a 70B target about 253× smaller. The paper describes this as approximately "two orders of magnitude" (200×).

N* vs target size M — the 200× line move slider to predict N* for your target
Target M = 70B → N* ≈ 277M
Target model M (params) N* predicted Ratio M/N* N* empirical (OPT family)
OPT-13B13B122M107×117M
OPT-30B30B168M179×298M
LLaMA 3/3.1-70B70B277M253×410M
Qwen1.5-110B110B385M286×378M

Dataset size has only a minor effect: training the draft on 100T tokens instead of 1T shifts N* by roughly 15–20%. This is captured by a small log(D) correction term (γ ≈ −0.0015 per log-token). For practical purposes the simple linear rule is sufficient.

06 — Empirical Validation

Wall-Clock Latency Confirms N*

To validate the analytical predictions, the paper measures end-to-end inference latency for an OPT-13B target model paired with all available draft models from OPT, Qwen 1.5, and Qwen 2.5 families on a single A100 GPU. The predicted optimal for OPT-13B is N* ≈ 117M (OPT family).

Key result: across all three draft families, latency is minimized near N* and increases monotonically as the draft size deviates from the predicted optimum. The metric |N − N*| / M (normalized distance from optimum) predicts where the latency minimum is — even across heterogeneous draft families with different architectures and tokenizers.

TPOT (time per output token) — OPT-13B target, A100 animates on scroll; lower is better
OPT draft family (same architecture as target)
OPT-125M (0.95% of M)
0.0101s
OPT-350M (2.69%)
0.0140s
OPT-1.3B (10% of M)
0.0147s
★ OPT-2.7B ≈ N*=117M
0.0175s
Qwen2.5 draft family (cross-family)
Qwen2.5-0.5B
0.1349s
★ Qwen2.5-1.5B ≈ N*
0.1675s
Qwen2.5-3B (far from N*)
0.2113s

Note: OPT and Qwen draft TPOT values differ because they use different tokenizers; the key comparison is within each family.

The OPT-2.7B result looks counterintuitive — why is a larger draft model showing lower TPOT than OPT-125M? The answer is that on a modern A100 with batch=1, 125M and 350M models are memory-bandwidth bound and both run nearly as fast as 2.7B, but 2.7B's higher α means fewer target reruns, winning on total latency. For the Qwen family, Qwen2.5-1.5B has the best balance: small enough to be fast, large enough to achieve α ≈ 0.63 — exactly what the N* formula predicts.

Practical Recipe

  • Estimate your target model's perplexity from pre-training scaling laws: ppl = exp(1.817 + 482/N0.348 + 2085/D0.366)
  • Predict draft perplexity at N* = 2.71×10⁻³ · M + 87M
  • Verify α ≥ 0.65 using the affine law: α = −0.0067·x + 0.013·y + 0.642
  • If α is too low, train the draft on more data (not larger model — extra data is cheaper)