Expensive Blind Search
Speculative decoding gives you "free" throughput: a tiny draft model proposes tokens, the big target model verifies them all in parallel, and you pay roughly the same FLOPs as one target pass but get 3–7 tokens out. The catch is that the speedup collapses if you pick the wrong draft size. Too small → low acceptance rate → target runs constantly anyway. Too large → draft itself becomes the bottleneck.
Current practice is expensive: pick a set of candidate draft sizes, train each one, run benchmarks, and repeat. For a 70B target this means training and evaluating multiple billion-parameter models — before deploying anything.
SDSL short-circuits this loop. Given only the target model size M and the pre-training dataset size D, it predicts the throughput-optimal draft size N* before any draft training begins.
The α Parameter
Speculative decoding uses two models: a small draft model Mq and the large target model Mp. Each iteration, Mq proposes γ tokens autoregressively. Mp then scores all γ positions in a single parallel forward pass and either accepts or rejects each token. If token i is rejected, all subsequent tokens are discarded and a fresh token is sampled from an adjusted distribution.
The key scalar summary of a (draft, target) pair is α — the expected per-token acceptance rate averaged across all prefixes. High α means the draft is well-aligned with the target; low α means frequent rejections and target re-runs.
The throughput formula is non-trivial — it involves a Lambert W function and depends on the ratio M/N. But the crucial point is: everything is determined by α and the size ratio. If you can predict α before training, you can predict throughput, and therefore find the optimal N.
α = Ax + By + C
The core empirical finding of SDSL is that α is almost entirely determined by the draft model's perplexity — and a simple affine function fits it remarkably well. The paper measures α for 13 target models (OPT, Qwen 1.5/2.5, LLaMA 3/3.1, Seed-OSS) paired with 9 draft models ranging from 125M to 3B parameters, across HellaSwag.
The headline result: draft perplexity drives α; target perplexity barely matters. Per-target fits (draft PPL only) reach R² = 0.97–0.98. The B coefficient (+0.013) is much smaller than |A| (0.0067), and its effect is inconsistent across target families.
The scatter plots reveal the asymmetry clearly: as draft perplexity decreases from 30 → 12, α climbs monotonically from ~0.60 to ~0.72 for LLaMA3.1-70B. The same draft model gives nearly identical α regardless of whether the target is a 14B or 110B model — only the target's absolute perplexity shifts the curve slightly.
This discovery is what makes SDSL tractable. You don't need to train the draft to know α — you just need to know its perplexity, which you can predict from Chinchilla-style pre-training scaling laws.
The Peak at N*
Plugging the α affine law and the pre-training scaling law (perplexity as a function of N, D) into the throughput formula gives throughput purely as a function of the training hyperparameters M, N, D. For any fixed target model M and training dataset D, throughput forms a unimodal curve over draft size N — peaking at an optimal N*.
The intuition: too-small drafts have high perplexity → low α → target runs constantly. Too-large drafts have low perplexity → high α, but the draft itself is expensive and dominates the per-step cost. The sweet spot balances both.
Empirically, the predicted N* aligns very well with the measured throughput-optimal draft sizes across OPT, Qwen 1.5, Qwen 2.5, LLaMA 3/3.1, and Seed-OSS model families (see Table 5 in the paper). The optimal draft is consistently in the 100–400M range for 13B–70B target models.
N* Scales Linearly with M
Across all model families and dataset sizes, the throughput-optimal draft size follows a clean linear law:
For realistic models, the M₀ offset term adds 87M parameters of "fixed overhead" — so for a 13B target the draft is about 107× smaller, and for a 70B target about 253× smaller. The paper describes this as approximately "two orders of magnitude" (200×).
| Target model | M (params) | N* predicted | Ratio M/N* | N* empirical (OPT family) |
|---|---|---|---|---|
| OPT-13B | 13B | 122M | 107× | 117M |
| OPT-30B | 30B | 168M | 179× | 298M |
| LLaMA 3/3.1-70B | 70B | 277M | 253× | 410M |
| Qwen1.5-110B | 110B | 385M | 286× | 378M |
Dataset size has only a minor effect: training the draft on 100T tokens instead of 1T shifts N* by roughly 15–20%. This is captured by a small log(D) correction term (γ ≈ −0.0015 per log-token). For practical purposes the simple linear rule is sufficient.
Wall-Clock Latency Confirms N*
To validate the analytical predictions, the paper measures end-to-end inference latency for an OPT-13B target model paired with all available draft models from OPT, Qwen 1.5, and Qwen 2.5 families on a single A100 GPU. The predicted optimal for OPT-13B is N* ≈ 117M (OPT family).
Key result: across all three draft families, latency is minimized near N* and increases monotonically as the draft size deviates from the predicted optimum. The metric |N − N*| / M (normalized distance from optimum) predicts where the latency minimum is — even across heterogeneous draft families with different architectures and tokenizers.
Note: OPT and Qwen draft TPOT values differ because they use different tokenizers; the key comparison is within each family.
The OPT-2.7B result looks counterintuitive — why is a larger draft model showing lower TPOT than OPT-125M? The answer is that on a modern A100 with batch=1, 125M and 350M models are memory-bandwidth bound and both run nearly as fast as 2.7B, but 2.7B's higher α means fewer target reruns, winning on total latency. For the Qwen family, Qwen2.5-1.5B has the best balance: small enough to be fast, large enough to achieve α ≈ 0.63 — exactly what the N* formula predicts.
Practical Recipe
- Estimate your target model's perplexity from pre-training scaling laws:
ppl = exp(1.817 + 482/N0.348 + 2085/D0.366) - Predict draft perplexity at N* = 2.71×10⁻³ · M + 87M
- Verify α ≥ 0.65 using the affine law: α = −0.0067·x + 0.013·y + 0.642
- If α is too low, train the draft on more data (not larger model — extra data is cheaper)