Multimodal Training Dynamics Interactive

The Modality Regret: How the AI Industry Solved Multimodal Text Degradation

For years, transforming a highly capable text model into a multimodal model acted like a lobotomy, severely degrading its coding and mathematical reasoning. Here is the engineering history of how we diagnosed the "modality tax" and the architectural breakthroughs that finally fixed it.

June 3, 2026 ~15 min read Reference Papers: arXiv:2304.08485, arXiv:2405.11073, arXiv:2409.20487
01 — The Modality Tax

The Sudden Collapse of Reasoning

In 2023, the open-source AI community witnessed a gold rush. Following the release of Meta's LLaMA, researchers figured out a remarkably elegant way to build Multimodal Large Language Models (MLLMs). By taking a powerful, pre-trained text model (like Vicuna-7B or Mistral-7B), locking a vision encoder (like CLIP ViT), and training a small linear adapter to translate image vectors into the LLM's token space, they created models like LLaVA and MiniGPT-4. They could see.

However, an alarming trade-off quickly emerged. When these adapted models were evaluated on pure text-only reasoning tasks, their performance collapsed. A base text model capable of solving 30% of grade-school algebra (GSM8K) or coding interview questions (HumanEval) would, after visual adaptation, drop to single digits. Translating images had somehow overwritten the logical neural pathways of the model.

Interactive: The Text Performance Penalty Toggle training recipes to see text-only benchmark collapse
Mathematical Reasoning (GSM8K)
Base LLM (Text-only)
55.0%
After Vision Adaptation
12.0%
Code Generation (HumanEval)
Base LLM (Text-only)
62.0%
After Vision Adaptation
18.0%
Core Language Reasoning (MMLU)
Base LLM (Text-only)
70.0%
After Vision Adaptation
45.0%

Notice how standard visual instruction tuning (VQA-only) strips away math/code performance, but balanced mixes or native training mitigates this tax.

02 — SFT Gradient Conflict

Why it Happens: Gradient Hijacking

The root cause of this regression is **gradient conflict** and representation shift. During pre-training, an LLM builds highly structured attention representations that execute complex causal step-by-step logic required for code and math. When we attach a vision encoder, we project high-dimensional visual patch embeddings (usually from a frozen CLIP or SigLIP model) directly into the word embedding space.

If we fine-tune the LLM backbone solely on visual question answering (VQA) datasets, the gradients computed on these inputs are entirely oriented toward aligning the newly introduced visual representations. These visual embeddings represent a complete out-of-distribution shock. Because there is no mathematical or logical reasoning data in the visual SFT dataset, the model updates its internal weights to represent visual shapes and spatial relationships, overwriting the narrow attention parameters that enforce coding syntax and logical inference.

Interactive: Gradient Update Trajectory Adjust the Backbone Unfrozen Ratio to observe task-vector alignment
100%
Task Gradient Vectors in Weight Space Green = Math/Code, Purple = Vision, Cyan = Actual Step

When the backbone is 100% unfrozen and we only train on vision instructions, the updates (Cyan) map purely onto the Vision-Language alignment vector (Purple), pulling the model's weights completely away from the Math/Code reasoning basin (Green dot).

03 — Balanced Replay

The First Fix: Joint Training Mixtures

The earliest and most straightforward mitigation was the introduction of **balanced replay mixtures**. Instead of fine-tuning the model exclusively on image-text pairs, researchers interleaved high-quality, pure-text instruction datasets (e.g. GSM8K, MBPP, UltraChat) into the visual instruction-tuning batches.

This mixed training acts as a powerful regularizer. The loss gradients computed on the text reasoning batches generate parameter update vectors that pull the model back toward its original reasoning capabilities, while the vision batches align the adapter. When Meta trained Llama 3.2-Vision, they adopted a similar methodology: they used cross-attention adapters to connect the image encoder and kept a heavy mix of pure text-reasoning data in both the alignment and pre-training stages, achieving near-zero regret on text-only tasks.

Simulation: Parameter Drift & Settle Set the Text data mix percentage and press Play to simulate training
20%
Active Parameter Path Epoch: 0

By keeping the mix at 50% or above, the parameter trajectory stays within the intersection zone (Cyan overlap), achieving stable vision understanding without degrading text reasoning.

04 — Native Pre-training

Breakthrough: Early-Fusion Architectures

While balanced mixtures solved the SFT forgetting problem, late-fusion adapter architectures still suffered from a fundamental bottleneck: the language model was still an afterthought for vision. In 2024, researchers introduced native multimodal models trained **from scratch** as unified architectures, such as Meta's **Chameleon** and Google's **Gemini**.

Instead of grafting an adapter onto a pre-trained LLM, early-fusion models tokenize both text and images into a single, shared vocabulary. An image is split into patches, which are mapped directly to discrete visual tokens using a VQ-GAN or projected into continuous embedding vectors. The entire model is then pre-trained from day one on interleaved web documents (e.g. text, images, and text-image sequences). Because text and vision co-evolve inside the transformer layers, the model learns a unified representation, completely eliminating the modality-grafting shock and maintaining text benchmark parity.

Architecture Contrast Toggle between Adapter Late-Fusion and Native Early-Fusion
Image Input Vision Encoder Linear Adapter Text Prompt Input Pre-trained LLM Backbone (Grafted)

In **Early Fusion**, there are no separate modalities or adapters. Images and text flow through the same layers as token sequences from day one, allowing the model to develop joint representations.

05 — Test-Time Compute

Extracting Dense Visual Information

Even with unified pre-training, a core physical limitation remained: compression. Standard vision encoders compress a large 1024x1024 pixel image into a small, fixed grid of visual tokens (often 256 or 576 tokens). This compression acts as a low-pass filter, wiping out dense, small text, math symbols, and small charts.

To solve this, 2025 reasoning-focused multimodal models (like **Qwen2.5-VL** and **Gemini 2.0 Thinking**) shifted from one-shot generation to **test-time compute (inference scaling)**. Instead of instantly spitting out an answer, the model generates an internal Chain of Thought (CoT), allocates compute to examine sub-regions of the image at native resolution, and dynamically "zooms in" on details of interest before executing final logical steps. This mirrors human perception, allowing the model to ground its textual logic directly in high-fidelity visual evidence.

Inference Step Navigator Step through visual reasoning steps to observe confidence score scaling
1
Initial Visual Scan (Global View)
The model takes a low-resolution scan of the image. It registers a geometry problem consisting of a square containing a circle, and a prompt: "Compute shaded area."
Visual tokens: 256 | Target identified: [Square, Circle]
2
Active Region Crop & Zoom
The model allocates compute to zoom into the bottom-right corner where a small label resides. It reads: "Radius = 7cm" (a detail lost in the low-res global view).
Visual tokens: 1024 (native patch) | Metric: Radius = 7
3
Chain-of-Thought Equation Formulating
The model generates reasoning steps: "If radius = 7, square side = 14. Area(Square) = 14*14 = 196. Area(Circle) = pi * 7^2 = 154."
Thinking tokens: 48 | Eq: Shaded = 196 - 154
4
Math Execution & Self-Correction
The model performs arithmetic subtraction: "196 - 154 = 42. Double check if any units are specified: cm^2. Final answer = 42 cm^2."
Confidence: 99.2% | Result: 42 cm^2
Step 1 of 4
06 — State of the Art

The Modality Tax is Dead

Today, the "modality tax" is officially zero. Modern multimodal models are trained using native early fusion or carefully regularized cross-attention adapters paired with vast, balanced pre-training mixtures. When we compare Llama 3.2 11B Vision against its text-only progenitor Llama 3.1 8B, or Qwen2-VL 7B against Qwen2 7B, their text benchmarks are identical or even slightly improved.

Here is how modern multimodal models stand compared to their text-only counterparts, proving that adding vision no longer degrades the core reasoning engines of AI:

Model Modality GSM8K (Math) HumanEval (Code) MMLU (Reasoning)
Llama 3.1 8B Instruct Text Only 84.5% 72.6% 73.0%
Llama 3.2 11B Vision Vision + Text 84.8% 73.1% 73.5%
Qwen2 7B Instruct Text Only 82.3% 78.4% 70.5%
Qwen2-VL 7B Instruct Vision + Text 83.1% 79.0% 71.2%

By shifting from naive fine-tuning to native early fusion and test-time visual crop search, AI models have transitioned from "blind text calculators" to grounded, spatial, and mathematical thinkers without sacrificing a single point of intellectual performance.