The friction between data scale and architectural agility
Speculative decoding speeds up inference by having a lightweight draft model propose future tokens, which the larger target model verifies in parallel. However, training high-performance draft models has a notorious bottleneck: the data gap. Autoregressive draft models trained post-pretraining are typically trained on small, secondary datasets for limited steps, leading to low draft accuracy (often <50% acceptance) and modest real-world speedups.
Multi-Token Prediction (MTP) addresses this by training auxiliary heads directly during pre-training, giving them the benefit of trillions of tokens of diverse world knowledge. But MTP creates a new problem: architectural lock-in. Because pre-training costs millions, MTP architectures must be set before training starts, forcing designers to stick to simple, safe architectures (like linear heads or shallow MLPs) and blocking experimentation with more advanced drafting techniques like feature-level regression (EAGLE) or tree-speculation (Medusa).
By pre-training with standard MTP heads and then converting those heads into specialized speculative architectures, we can merge the dense prior representation learned during pre-training with the architectural benefits of advanced speculative drafters. This allows us to escape the pre-training architectural lock-in.
How co-training shapes the hidden representations
During MTP co-training, the model is optimized for predicting multiple future tokens simultaneously. For instance, in DeepSeek-V3, at each token position, the model uses sequential auxiliary modules to predict the next D tokens. This changes the model's training dynamics in two crucial ways:
- Denser Gradient Signals: Instead of predicting only the single next token, the backbone receives gradients from D prediction tasks at every step, speeding up representation learning.
- Feature-Space Lookahead: The MTP representation extractors learn to map the base model's hidden state ht to subsequent states ht+d. The layers do not just model token probabilities, they learn the temporal trajectory of hidden representations.
In DeepSeek-V3's sequential MTP architecture, the hidden states ht are passed sequentially through D extractor layers. At each step d, the extractor combines the previous state htd-1 with the embedding of the predicted token t+d-1 to predict the subsequent token t+d. This sequential structure provides a highly causal and structured pathway that can be recycled.
Why pre-trained MTP layers are the perfect seed
The core hypothesis is that **pre-trained MTP heads contain a rich representations prior**. Since they have been co-trained with the base model on trillions of tokens, they already have a deep understanding of syntax, vocabulary projections, and lookahead semantics. Instead of discarding these heads or serving them with their original locked architectures, we can map their weights to form the starting state of more advanced speculators.
We map: W_mtp_k → W_speculator_k
This initializes the speculator with pre-trained lookahead knowledge, reducing downstream fine-tuning steps by 95% (e.g. from 50B tokens to 2B tokens).
This post-hoc weight conversion represents a paradigm shift: pre-training is used to build the core representations and lookahead priors, while post-training is used to build the final, optimized speculator architecture. We explore two concrete pathways for this conversion: Medusa and EAGLE.
Transforming parallel heads into a multi-branch speculative tree
Medusa speeds up generation by attaching multiple decoding heads to the final hidden state of a base LLM, creating a tree of candidate sequences. Since pre-trained MTP heads are structurally similar (taking hidden states and projecting them to token probabilities), we can directly map MTP weights to initialize Medusa heads.
During the conversion, the linear projection layers of the $k$-th MTP head are copied into the $k$-th Medusa head. Because the base model has been modified during post-training (e.g., SFT and RL), there is a slight distribution shift. We resolve this by running a brief self-distillation phase where we freeze the base LLM and fine-tune only the converted Medusa heads on a small corpus of 1–2B tokens.
Recycling sequential modules into autoregressive feature extrapolators
EAGLE represents the state of the art in speculative decoding by predicting hidden features autoregressively, rather than tokens in parallel. DeepSeek-V3's MTP module is uniquely suited for EAGLE conversion because it is built from sequential extractor blocks that model the transition from htd-1 to htd.
To convert MTP into an EAGLE speculator, we map the pre-trained sequential MTP extractor layers (which contain attention blocks and projection layers) directly into the recurrent layer of the EAGLE draft model. Since the MTP extractor was co-trained to model hidden feature transformations, it provides a highly accurate starting state for the feature regression task, bypassing the costly from-scratch training that EAGLE typically requires.
In this simulated feature trajectory space, the purple line represents the ground truth path of hidden states computed by the target model. The cyan line is the trajectory proposed by the draft model. When initialized with the pre-trained MTP prior, the draft trajectory closely matches the target's path, leading to higher token acceptance.
Throughput and Speedup Simulator
This interactive simulator estimates the real-world inference speedup of recycled speculative models compared to traditional autoregressive decoding under various hardware parameters, batch sizes, and tuning levels.
Roadmap for implementation
To validate the recycling concept, we suggest the following structured roadmap:
- Phase 1: Weights Extraction (Checkpoint Auditing) — Write a script to extract MTP weights from a co-trained Gemma-4 or DeepSeek-V3-like pre-training checkpoint, validating projection shape and matrix properties.
- Phase 2: The Medusa Conversion Recipe — Build a minimal repository converting the extracted weights to a Medusa state, then run a short self-distillation training loop on 1B tokens. Measure the initial acceptance rate vs a from-scratch baseline.
- Phase 3: The EAGLE Adaptation Pipeline — Map sequential extractor layers to an EAGLE draft structure, running joint autoregressive hidden-feature distance regression.
Downstream fine-tuning duration tip: because the MTP heads have already converged on lookup representations during pre-training, fine-tuning requires 10–20× fewer steps than from-scratch training, making this recipe accessible on commercial hardware configurations.