Two Methods, One Impossible Choice
Speculative decoding works by having a cheap draft model propose multiple tokens, then letting the expensive target model verify them all in one parallel pass. The key insight: if the draft model proposes a run of tokens that the target would have chosen anyway, the target gets to advance by multiple positions per verification call — effectively multiplying its throughput.
But there's a fundamental tension in how you build the draft model. The two main camps are:
- Autoregressive drafters (e.g. EAGLE-3): Generate tokens one at a time, each conditioned on all previous draft tokens. High quality — the draft mirrors the target's causal factorization — but sequential: drafting γ tokens costs γ forward passes.
- Parallel drafters (e.g. DFlash, DART): Generate the entire draft block in one non-autoregressive forward pass. Fast, but blind to intra-block causal dependencies — each position is conditioned only on the prefix, not on earlier draft tokens.
The numbers from Figure 1 of the paper tell the story clearly: on Qwen3-8B with a 16-token budget, EAGLE-3 achieves an acceptance length τ = 4.86 but only 3.28× speedup — its sequential draft execution eats into the gains. DFlash reaches 3.42× speedup because drafting is cheap, but τ drops to 4.03. Domino hits τ = 4.70 and 3.84× speedup in that micro-comparison, with the key difference being it uses far less per-draft latency than EAGLE-3 while recovering most of EAGLE-3's causal quality advantage.
What Actually Determines Speedup
The speedup of speculative decoding over standard autoregressive decoding has a clean formula. Let τ be the expected number of tokens accepted per cycle (including the target model's bonus token), Tdraft be drafting time, Tverify be verification time, and Ltarget be per-token latency without speculative decoding:
// two independent levers: increase τ (draft quality) OR decrease T_draft (draft efficiency)
This formula clarifies the trade-off precisely. Tverify is roughly fixed — it's one full target-model forward pass. So speedup is determined by two factors:
- Higher τ (acceptance length) means each expensive target call advances more tokens. EAGLE-3 excels here: τ ≈ 4.86 on GSM8K (16-token budget, Qwen3-8B).
- Lower Tdraft means draft generation wastes less time. DFlash excels here: it costs roughly one draft-model forward + one LM-head call, regardless of block size γ.
For autoregressive drafters, Tdraft ≈ γ · (tnet + thead) — it scales linearly with draft length. For parallel drafters, Tdraft ≈ tnetblock + theadblock — paid once for the whole block. Domino adds a tiny overhead: the Domino head costs 1.2ms on top of DFlash's 5.9ms draft time, a 2.8% increase in total draft-then-verify latency.
Parallel Backbone + Lightweight Causal Head
Domino has two components that compose cleanly:
Parallel Draft Backbone
Domino instantiates the backbone as DFlash — a block-diffusion style drafter that generates representations for all γ positions in one non-autoregressive forward pass. Given the last verified token xt as an anchor, it constructs a masked input block [xt, MASK, MASK, …] and runs it through a 5-layer transformer alongside target-model context features Ct. This produces hidden states Ht…Ht+B-1 for the whole block at once. The frozen target LM head then converts each to base logits Lbasei.
Domino Head
The Domino head sits on top of the backbone and has two parts: a causal encoder (a GRU with hidden dimension 1024) that summarizes all previously sampled draft token embeddings, and a low-rank correction head (bottleneck dimension r=256) that maps [Hi; Si-1] to a logit-space residual ΔLi. The final distribution is simply Lbasei + ΔLi. Only 56M extra parameters, added once — no repeated full LM-head calls.
How the GRU Injects Causal Information
The central mechanism is surprisingly simple: a GRU reads the embeddings of sampled draft tokens one by one, maintaining a rolling causal state Si. When predicting draft token at position i, the causal state Si-1 captures everything about the preceding draft tokens — without any additional draft-model forward passes.
The key insight is that the GRU state Si is dirt cheap to compute — just an RNN step — while the backbone's hidden state Hi and base logits are already available from the parallel forward pass. The entire correction loop runs in 1.2ms on an A100, compared to 35.5ms for a full target-model verification pass.
Why Training Is Tricky (And How Domino Fixes It)
Training the causal correction branch introduces two failure modes that don't exist in standard parallel drafting. Domino addresses both with a two-part training strategy.
Problem 1: What prefix should the GRU see?
The GRU needs to read prefix draft tokens during training. The natural choice is self-generated prefixes (training-time testing, TTT) — sample from the model itself and train on those. EAGLE-3 uses this approach. But Domino instead uses teacher forcing: feed the GRU ground-truth token embeddings during training.
Teacher forcing works better for a subtle reason: the causal encoder's correction at position i only matters when all preceding draft tokens have been accepted by the target model. If position 1 is rejected, positions 2–4 are never reached. So training the GRU on clean ground-truth prefixes focuses learning exactly on the regime that matters — the accepted-prefix regime. Noisy self-generated prefixes create an input–output mapping that doesn't exist in the data distribution.
Problem 2: Backbone collapse
Teacher forcing introduces a new failure mode: since the correction branch receives clean, informative prefixes during training, it can shortcut the parallel backbone. The backbone's base logits become weak — all the predictive signal flows through the correction branch, which has no fallback if backbone representations degrade.
The fix is a base-anchored curriculum: jointly train on both base logits and final logits, with a loss weight λt that starts at 1.0 (pure backbone loss) and linearly anneals to 0.0 (pure final-logit loss):
// λ_t: 1.0 → 0.0 linearly over training; both terms are cross-entropy with exponential position decay
This forces the backbone to develop a strong base distribution early in training, before the Domino head takes over residual correction. Without it, backbone loss plateaus high; with it, both components learn complementary representations and backbone loss decreases steadily.
Up to 7.92× Speedup on Qwen3
Domino is evaluated on Qwen3-4B and Qwen3-8B across 8 benchmarks spanning math, code, and dialogue. All models use a 16-token draft block. The comparison includes EAGLE-3 (autoregressive), DART and DFlash (parallel), and FR-Spec (vocabulary-efficient).
Average over GSM8K, MATH-500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca. Bar width scaled to 7.92× max (GSM8K peak).
The headline numbers: on Qwen3-8B with greedy decoding, Domino improves over DFlash from 4.66× to 5.49× average speedup (17.8% relative gain). The peak is 7.92× on GSM8K, where acceptance length reaches τ = 10.03. On sampling (T=1), average speedup goes from 3.96× (DFlash) to 4.46× (Domino).
Per-benchmark breakdown (Qwen3-8B, T=0)
| Method | GSM8K | MATH-500 | HumanEval | MBPP | MT-Bench | Avg |
|---|---|---|---|---|---|---|
| EAGLE-3 (16) | 2.21× | 2.09× | 2.17× | 1.93× | 1.82× | 1.97× |
| DART (60) | 2.28× | 2.29× | 2.52× | 2.39× | 2.27× | 2.29× |
| DFlash (16) | 5.21× | 6.18× | 5.21× | 4.71× | 2.73× | 4.66× |
| Domino (16) | 7.92× | 7.38× | 5.89× | 5.53× | 3.29× | 5.49× |
The gains are largest on structured reasoning tasks (math +52%, code +13%) and smaller on open-ended dialogue (MT-Bench +20%). This pattern matches the hypothesis: reasoning tasks have more predictable token sequences, so causal correction — which learns to exploit those patterns — has more room to improve acceptance length.
High-concurrency serving (SGLang)
At serving scale (Table 2), Domino achieves up to 5.8× throughput on Qwen3-8B at concurrency=2 on GSM8K. The gains persist across all concurrency levels tested (2–32), confirming that Domino's improved draft quality translates to higher tokens-per-second in production serving environments, not just low-concurrency benchmarks.