The Sudden Collapse of Reasoning
In 2023, the open-source AI community witnessed a gold rush. Following the release of Meta's LLaMA, researchers figured out a remarkably elegant way to build Multimodal Large Language Models (MLLMs). By taking a powerful, pre-trained text model (like Vicuna-7B or Mistral-7B), locking a vision encoder (like CLIP ViT), and training a small linear adapter to translate image vectors into the LLM's token space, they created models like LLaVA and MiniGPT-4. They could see.
However, an alarming trade-off quickly emerged. When these adapted models were evaluated on pure text-only reasoning tasks, their performance collapsed. A base text model capable of solving 30% of grade-school algebra (GSM8K) or coding interview questions (HumanEval) would, after visual adaptation, drop to single digits. Translating images had somehow overwritten the logical neural pathways of the model.
Notice how standard visual instruction tuning (VQA-only) strips away math/code performance, but balanced mixes or native training mitigates this tax.
Why it Happens: Gradient Hijacking
The root cause of this regression is **gradient conflict** and representation shift. During pre-training, an LLM builds highly structured attention representations that execute complex causal step-by-step logic required for code and math. When we attach a vision encoder, we project high-dimensional visual patch embeddings (usually from a frozen CLIP or SigLIP model) directly into the word embedding space.
If we fine-tune the LLM backbone solely on visual question answering (VQA) datasets, the gradients computed on these inputs are entirely oriented toward aligning the newly introduced visual representations. These visual embeddings represent a complete out-of-distribution shock. Because there is no mathematical or logical reasoning data in the visual SFT dataset, the model updates its internal weights to represent visual shapes and spatial relationships, overwriting the narrow attention parameters that enforce coding syntax and logical inference.
When the backbone is 100% unfrozen and we only train on vision instructions, the updates (Cyan) map purely onto the Vision-Language alignment vector (Purple), pulling the model's weights completely away from the Math/Code reasoning basin (Green dot).
The First Fix: Joint Training Mixtures
The earliest and most straightforward mitigation was the introduction of **balanced replay mixtures**. Instead of fine-tuning the model exclusively on image-text pairs, researchers interleaved high-quality, pure-text instruction datasets (e.g. GSM8K, MBPP, UltraChat) into the visual instruction-tuning batches.
This mixed training acts as a powerful regularizer. The loss gradients computed on the text reasoning batches generate parameter update vectors that pull the model back toward its original reasoning capabilities, while the vision batches align the adapter. When Meta trained Llama 3.2-Vision, they adopted a similar methodology: they used cross-attention adapters to connect the image encoder and kept a heavy mix of pure text-reasoning data in both the alignment and pre-training stages, achieving near-zero regret on text-only tasks.
By keeping the mix at 50% or above, the parameter trajectory stays within the intersection zone (Cyan overlap), achieving stable vision understanding without degrading text reasoning.
Breakthrough: Early-Fusion Architectures
While balanced mixtures solved the SFT forgetting problem, late-fusion adapter architectures still suffered from a fundamental bottleneck: the language model was still an afterthought for vision. In 2024, researchers introduced native multimodal models trained **from scratch** as unified architectures, such as Meta's **Chameleon** and Google's **Gemini**.
Instead of grafting an adapter onto a pre-trained LLM, early-fusion models tokenize both text and images into a single, shared vocabulary. An image is split into patches, which are mapped directly to discrete visual tokens using a VQ-GAN or projected into continuous embedding vectors. The entire model is then pre-trained from day one on interleaved web documents (e.g. text, images, and text-image sequences). Because text and vision co-evolve inside the transformer layers, the model learns a unified representation, completely eliminating the modality-grafting shock and maintaining text benchmark parity.
In **Early Fusion**, there are no separate modalities or adapters. Images and text flow through the same layers as token sequences from day one, allowing the model to develop joint representations.
Extracting Dense Visual Information
Even with unified pre-training, a core physical limitation remained: compression. Standard vision encoders compress a large 1024x1024 pixel image into a small, fixed grid of visual tokens (often 256 or 576 tokens). This compression acts as a low-pass filter, wiping out dense, small text, math symbols, and small charts.
To solve this, 2025 reasoning-focused multimodal models (like **Qwen2.5-VL** and **Gemini 2.0 Thinking**) shifted from one-shot generation to **test-time compute (inference scaling)**. Instead of instantly spitting out an answer, the model generates an internal Chain of Thought (CoT), allocates compute to examine sub-regions of the image at native resolution, and dynamically "zooms in" on details of interest before executing final logical steps. This mirrors human perception, allowing the model to ground its textual logic directly in high-fidelity visual evidence.
The Modality Tax is Dead
Today, the "modality tax" is officially zero. Modern multimodal models are trained using native early fusion or carefully regularized cross-attention adapters paired with vast, balanced pre-training mixtures. When we compare Llama 3.2 11B Vision against its text-only progenitor Llama 3.1 8B, or Qwen2-VL 7B against Qwen2 7B, their text benchmarks are identical or even slightly improved.
Here is how modern multimodal models stand compared to their text-only counterparts, proving that adding vision no longer degrades the core reasoning engines of AI:
| Model | Modality | GSM8K (Math) | HumanEval (Code) | MMLU (Reasoning) |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | Text Only | 84.5% | 72.6% | 73.0% |
| Llama 3.2 11B Vision | Vision + Text | 84.8% | 73.1% | 73.5% |
| Qwen2 7B Instruct | Text Only | 82.3% | 78.4% | 70.5% |
| Qwen2-VL 7B Instruct | Vision + Text | 83.1% | 79.0% | 71.2% |
By shifting from naive fine-tuning to native early fusion and test-time visual crop search, AI models have transitioned from "blind text calculators" to grounded, spatial, and mathematical thinkers without sacrificing a single point of intellectual performance.