Recycling Pre-Trained MTP Modules for Speculative Decoding

01 — The Speculative Dilemma

The friction between data scale and architectural agility

Speculative decoding speeds up inference by having a lightweight draft model propose future tokens, which the larger target model verifies in parallel. However, training high-performance draft models has a notorious bottleneck: the data gap. Autoregressive draft models trained post-pretraining are typically trained on small, secondary datasets for limited steps, leading to low draft accuracy (often <50% acceptance) and modest real-world speedups.

Multi-Token Prediction (MTP) addresses this by training auxiliary heads directly during pre-training, giving them the benefit of trillions of tokens of diverse world knowledge. But MTP creates a new problem: architectural lock-in. Because pre-training costs millions, MTP architectures must be set before training starts, forcing designers to stick to simple, safe architectures (like linear heads or shallow MLPs) and blocking experimentation with more advanced drafting techniques like feature-level regression (EAGLE) or tree-speculation (Medusa).

pipeline-comparison — training workflows visualizing draft model data scale and flexibility

Post-Hoc Draft Traditional Speculative Decoding

Pre-Train Target (10T tokens) → Train Small Draft (50B tokens) → Serve with Target

Draft Data Exposure: Low (<1%) Inference Acceptance Rate (α): ~45%

Native MTP Pre-trained Multi-Token Prediction Heads

Pre-Train Backbone + MTP Heads Jointly (10T tokens) → Serve Native MTP

Draft Data Exposure: Full (100%) Inference Acceptance Rate (α): ~80% (Locked Arch)

Recycled MTP Proposed: Pre-Train MTP + Adapt Post-Hoc

Pre-Train Backbone + Simple MTP → Convert Heads to EAGLE/Medusa → Fine-Tune (10B tokens)

Draft Data Exposure: Full Prior + Fine-tuning Inference Acceptance Rate (α): ~82% (Optimized)

By pre-training with standard MTP heads and then converting those heads into specialized speculative architectures, we can merge the dense prior representation learned during pre-training with the architectural benefits of advanced speculative drafters. This allows us to escape the pre-training architectural lock-in.

02 — MTP Pre-Training Dynamics

How co-training shapes the hidden representations

During MTP co-training, the model is optimized for predicting multiple future tokens simultaneously. For instance, in DeepSeek-V3, at each token position, the model uses sequential auxiliary modules to predict the next D tokens. This changes the model's training dynamics in two crucial ways:

Denser Gradient Signals: Instead of predicting only the single next token, the backbone receives gradients from D prediction tasks at every step, speeding up representation learning.
Feature-Space Lookahead: The MTP representation extractors learn to map the base model's hidden state h_t to subsequent states h_t+d. The layers do not just model token probabilities, they learn the temporal trajectory of hidden representations.

DeepSeek-V3 Sequential MTP Architecture tokens flow through auxiliary layers to predict ahead

In DeepSeek-V3's sequential MTP architecture, the hidden states h_t are passed sequentially through D extractor layers. At each step d, the extractor combines the previous state h_t^d-1 with the embedding of the predicted token t+d-1 to predict the subsequent token t+d. This sequential structure provides a highly causal and structured pathway that can be recycled.

03 — The Recycling Thesis

Why pre-trained MTP layers are the perfect seed

The core hypothesis is that **pre-trained MTP heads contain a rich representations prior**. Since they have been co-trained with the base model on trillions of tokens, they already have a deep understanding of syntax, vocabulary projections, and lookahead semantics. Instead of discarding these heads or serving them with their original locked architectures, we can map their weights to form the starting state of more advanced speculators.

Weight Mapping Viability Let W_mtp_k be the pre-trained weights of MTP head k . We map: W_mtp_k \to W_speculator_k This initializes the speculator with pre-trained lookahead knowledge, reducing downstream fine-tuning steps by 95% (e.g. from 50B tokens to 2B tokens).

This post-hoc weight conversion represents a paradigm shift: pre-training is used to build the core representations and lookahead priors, while post-training is used to build the final, optimized speculator architecture. We explore two concrete pathways for this conversion: Medusa and EAGLE.

04 — Converting to Medusa

Transforming parallel heads into a multi-branch speculative tree

Medusa speeds up generation by attaching multiple decoding heads to the final hidden state of a base LLM, creating a tree of candidate sequences. Since pre-trained MTP heads are structurally similar (taking hidden states and projecting them to token probabilities), we can directly map MTP weights to initialize Medusa heads.

During the conversion, the linear projection layers of the $k$-th MTP head are copied into the $k$-th Medusa head. Because the base model has been modified during post-training (e.g., SFT and RL), there is a slight distribution shift. We resolve this by running a brief self-distillation phase where we freeze the base LLM and fine-tune only the converted Medusa heads on a small corpus of 1–2B tokens.

Medusa Tree Construction Walkthrough click Next to see the steps

Load Pre-trained MTP Checkpoint

Extract the weights of the target base model and the $K$ co-trained parallel MTP heads.

base_state_dict, mtp_state_dicts = load_checkpoint(path)

Initialize Medusa Heads with MTP Weights

Map the linear projection matrices of the MTP heads directly into the new Medusa head layers. This transfers the pre-trained lookup mapping.

medusa_heads[k].load_state_dict(mtp_state_dicts[k])

Generate Speculative Tree Structure

Use the initialized heads to draft a tree of candidate tokens (e.g. 1 head predicts top-2, 2nd head predicts top-2, creating a tree of 4 candidates).

candidates = generate_speculative_tree(hidden_states, medusa_heads)

Self-Distillation Fine-Tuning

Run self-distillation on a small corpus to align the recycled heads with the post-trained base model, pushing acceptance rate from ~55% to >72%.

distill_heads(target_model, medusa_heads, data, steps=1000)

Step 1 of 4

05 — Converting to EAGLE

Recycling sequential modules into autoregressive feature extrapolators

EAGLE represents the state of the art in speculative decoding by predicting hidden features autoregressively, rather than tokens in parallel. DeepSeek-V3's MTP module is uniquely suited for EAGLE conversion because it is built from sequential extractor blocks that model the transition from h_t^d-1 to h_t^d.

To convert MTP into an EAGLE speculator, we map the pre-trained sequential MTP extractor layers (which contain attention blocks and projection layers) directly into the recurrent layer of the EAGLE draft model. Since the MTP extractor was co-trained to model hidden feature transformations, it provides a highly accurate starting state for the feature regression task, bypassing the costly from-scratch training that EAGLE typically requires.

Feature Trajectory Space — 2D Projection visualizing draft paths vs target path

In this simulated feature trajectory space, the purple line represents the ground truth path of hidden states computed by the target model. The cyan line is the trajectory proposed by the draft model. When initialized with the pre-trained MTP prior, the draft trajectory closely matches the target's path, leading to higher token acceptance.

06 — Interactivity Lab

Throughput and Speedup Simulator

This interactive simulator estimates the real-world inference speedup of recycled speculative models compared to traditional autoregressive decoding under various hardware parameters, batch sizes, and tuning levels.

throughput-simulator.canvas adjust parameters to see speedup and acceptance rate change

Draft Architecture

Fine-Tuning Steps (Corpus Size)

3,000

Batch Size (Concurrent Requests)

Acceptance Rate (α)

74.5%

Speedup Factor
1.76×

07 — Next Steps

Roadmap for implementation

To validate the recycling concept, we suggest the following structured roadmap:

Phase 1: Weights Extraction (Checkpoint Auditing) — Write a script to extract MTP weights from a co-trained Gemma-4 or DeepSeek-V3-like pre-training checkpoint, validating projection shape and matrix properties.
Phase 2: The Medusa Conversion Recipe — Build a minimal repository converting the extracted weights to a Medusa state, then run a short self-distillation training loop on 1B tokens. Measure the initial acceptance rate vs a from-scratch baseline.
Phase 3: The EAGLE Adaptation Pipeline — Map sequential extractor layers to an EAGLE draft structure, running joint autoregressive hidden-feature distance regression.

Downstream fine-tuning duration tip: because the MTP heads have already converged on lookup representations during pre-training, fine-tuning requires 10–20× fewer steps than from-scratch training, making this recipe accessible on commercial hardware configurations.