Speculative Decoding Scaling Laws: The 200× Rule

01 — The Problem

Expensive Blind Search

Speculative decoding gives you "free" throughput: a tiny draft model proposes tokens, the big target model verifies them all in parallel, and you pay roughly the same FLOPs as one target pass but get 3–7 tokens out. The catch is that the speedup collapses if you pick the wrong draft size. Too small → low acceptance rate → target runs constantly anyway. Too large → draft itself becomes the bottleneck.

Current practice is expensive: pick a set of candidate draft sizes, train each one, run benchmarks, and repeat. For a 70B target this means training and evaluating multiple billion-parameter models — before deploying anything.

Status quo: empirical draft search click Next to walk through each step

Guess a draft size

Pick N = 1B and train it on the same data as the 70B target. Costs ~0.5% of the target's budget.

train_draft(N=1B, data=D, target=70B)

Benchmark acceptance rate α

Run speculative decoding on a eval set. Measure how often the draft's tokens are accepted. Find α ≈ 0.70.

α = measure_acceptance(draft_1B, target_70B)

Try smaller — maybe the draft bottlenecks

250M would be 4× faster to run. Train it, benchmark it. Find α = 0.65 — worse acceptance, similar throughput.

train_draft(N=250M, data=D, target=70B)

Try larger — maybe you need more quality

3B might push α above 0.75. Train, benchmark. But now the draft overhead eats the throughput gain.

train_draft(N=3B, data=D, target=70B) # 😰

Pick the winner… and hope

After 3–5 training runs you pick the best. But every time the target model changes, you repeat from step 1.

deploy(best_draft) # until target is updated

Step 1 of 5

SDSL short-circuits this loop. Given only the target model size M and the pre-training dataset size D, it predicts the throughput-optimal draft size N* before any draft training begins.

02 — Speculative Decoding 101

The α Parameter

Speculative decoding uses two models: a small draft model M_q and the large target model M_p. Each iteration, M_q proposes γ tokens autoregressively. M_p then scores all γ positions in a single parallel forward pass and either accepts or rejects each token. If token i is rejected, all subsequent tokens are discarded and a fresh token is sampled from an adjusted distribution.

Acceptance rule — token x i is accepted iff: r i ~ Uniform(0, 1) < p(x i | x<i) / q(x i | x<i) \to accepted when target probability \geq draft probability (always), otherwise probabilistically

The key scalar summary of a (draft, target) pair is α — the expected per-token acceptance rate averaged across all prefixes. High α means the draft is well-aligned with the target; low α means frequent rejections and target re-runs.

Throughput vs α — interactive click an α value to see the effect

α =

Throughput at optimal lookahead γ (Equation 4 in SDSL) T = -log(α) / [2N \cdot (α - 1) \cdot W -1 ( - α (M/N-1) / e ) ] W₋₁ = Lambert W function, lower branch. M = target size, N = draft size.

The throughput formula is non-trivial — it involves a Lambert W function and depends on the ratio M/N. But the crucial point is: everything is determined by α and the size ratio. If you can predict α before training, you can predict throughput, and therefore find the optimal N.

03 — The α Affine Law

α = Ax + By + C

The core empirical finding of SDSL is that α is almost entirely determined by the draft model's perplexity — and a simple affine function fits it remarkably well. The paper measures α for 13 target models (OPT, Qwen 1.5/2.5, LLaMA 3/3.1, Seed-OSS) paired with 9 draft models ranging from 125M to 3B parameters, across HellaSwag.

Affine α scaling law (Eq. 5 in SDSL) α = A \cdotx + B \cdoty + C x = draft perplexity, y = target perplexity Fitted: A = -0.0067, B = +0.013, C = +0.642 (R² = 0.60 overall)

The headline result: draft perplexity drives α; target perplexity barely matters. Per-target fits (draft PPL only) reach R² = 0.97–0.98. The B coefficient (+0.013) is much smaller than |A| (0.0067), and its effect is inconsistent across target families.

α vs draft perplexity — real data click a target model to highlight its curve

The scatter plots reveal the asymmetry clearly: as draft perplexity decreases from 30 → 12, α climbs monotonically from ~0.60 to ~0.72 for LLaMA3.1-70B. The same draft model gives nearly identical α regardless of whether the target is a 14B or 110B model — only the target's absolute perplexity shifts the curve slightly.

This discovery is what makes SDSL tractable. You don't need to train the draft to know α — you just need to know its perplexity, which you can predict from Chinchilla-style pre-training scaling laws.

04 — Throughput Sweet Spot

The Peak at N*

Plugging the α affine law and the pre-training scaling law (perplexity as a function of N, D) into the throughput formula gives throughput purely as a function of the training hyperparameters M, N, D. For any fixed target model M and training dataset D, throughput forms a unimodal curve over draft size N — peaking at an optimal N*.

The intuition: too-small drafts have high perplexity → low α → target runs constantly. Too-large drafts have low perplexity → high α, but the draft itself is expensive and dominates the per-step cost. The sweet spot balances both.

Throughput vs draft size N — click a target model star ★ marks predicted optimal N*

Empirically, the predicted N* aligns very well with the measured throughput-optimal draft sizes across OPT, Qwen 1.5, Qwen 2.5, LLaMA 3/3.1, and Seed-OSS model families (see Table 5 in the paper). The optimal draft is consistently in the 100–400M range for 13B–70B target models.

05 — The 200× Rule

N* Scales Linearly with M

Across all model families and dataset sizes, the throughput-optimal draft size follows a clean linear law:

SDSL optimal draft size (Eq. 11 in SDSL) N* (M) = µ \cdot M + M₀ µ = 2.71 \times 10 -3, M₀ = 87.1 \times 10 6 params \to as M \to \infty, N*/M \to 1/369 \approx 0.27% — the draft is ~370\times smaller

For realistic models, the M₀ offset term adds 87M parameters of "fixed overhead" — so for a 13B target the draft is about 107× smaller, and for a 70B target about 253× smaller. The paper describes this as approximately "two orders of magnitude" (200×).

N* vs target size M — the 200× line move slider to predict N* for your target

Target M = 70B → N* ≈ 277M

Target model	M (params)	N* predicted	Ratio M/N*	N* empirical (OPT family)
OPT-13B	13B	122M	107×	117M
OPT-30B	30B	168M	179×	298M
LLaMA 3/3.1-70B	70B	277M	253×	410M
Qwen1.5-110B	110B	385M	286×	378M

Dataset size has only a minor effect: training the draft on 100T tokens instead of 1T shifts N* by roughly 15–20%. This is captured by a small log(D) correction term (γ ≈ −0.0015 per log-token). For practical purposes the simple linear rule is sufficient.

06 — Empirical Validation

Wall-Clock Latency Confirms N*

To validate the analytical predictions, the paper measures end-to-end inference latency for an OPT-13B target model paired with all available draft models from OPT, Qwen 1.5, and Qwen 2.5 families on a single A100 GPU. The predicted optimal for OPT-13B is N* ≈ 117M (OPT family).

Key result: across all three draft families, latency is minimized near N* and increases monotonically as the draft size deviates from the predicted optimum. The metric |N − N*| / M (normalized distance from optimum) predicts where the latency minimum is — even across heterogeneous draft families with different architectures and tokenizers.

TPOT (time per output token) — OPT-13B target, A100 animates on scroll; lower is better

OPT draft family (same architecture as target)

OPT-125M (0.95% of M)

0.0101s

OPT-350M (2.69%)

0.0140s

OPT-1.3B (10% of M)

0.0147s

★ OPT-2.7B ≈ N*=117M

0.0175s

Qwen2.5 draft family (cross-family)

Qwen2.5-0.5B

0.1349s

★ Qwen2.5-1.5B ≈ N*

0.1675s

Qwen2.5-3B (far from N*)

0.2113s

Note: OPT and Qwen draft TPOT values differ because they use different tokenizers; the key comparison is within each family.

The OPT-2.7B result looks counterintuitive — why is a larger draft model showing lower TPOT than OPT-125M? The answer is that on a modern A100 with batch=1, 125M and 350M models are memory-bandwidth bound and both run nearly as fast as 2.7B, but 2.7B's higher α means fewer target reruns, winning on total latency. For the Qwen family, Qwen2.5-1.5B has the best balance: small enough to be fast, large enough to achieve α ≈ 0.63 — exactly what the N* formula predicts.

Practical Recipe

Estimate your target model's perplexity from pre-training scaling laws: ppl = exp(1.817 + 482/N^0.348 + 2085/D^0.366)
Predict draft perplexity at N* = 2.71×10⁻³ · M + 87M
Verify α ≥ 0.65 using the affine law: α = −0.0067·x + 0.013·y + 0.642
If α is too low, train the draft on more data (not larger model — extra data is cheaper)