Three Models Where One Should Suffice
Consider a home robot instructed to clear a dinner table. Under the current paradigm, the robot needs three separate systems: a vision-language model to locate the dishes and generate a plan, a vision-language-action model to convert that plan into motor commands, and a world model to simulate future states and evaluate whether the planned action sequence is safe. These three systems are trained on different datasets, maintain different internal representations, and communicate through awkward intermediate formats.
This fragmentation is not just inelegant — it's fundamentally limiting. A world model that doesn't share weights with the VLM can't leverage the rich semantic understanding the VLM has built. A VLA that doesn't share weights with the world model can't simulate the visual consequences of its own actions. The systems are strung together like incompatible adapters.
Cosmos 3 argues that this paradigm separation is fundamentally limiting. Understanding requires reasoning about future world evolution and action consequences — capabilities that live in generative models. Generation relies on compact, structured world representations — capabilities that live in understanding models. The two are inseparable, and forcing them into different model classes wastes both the compute and the potential.
The solution: a single omnimodal world model trained jointly on language, image, video, audio, and action data, using a unified Mixture-of-Transformers (MoT) backbone that dedicates separate parameter sets to reasoning and generation while sharing attention across both.
Everything Is Tokens — But Not All Tokens Are Equal
The first step to unifying five modalities is encoding them into a common representation space. Cosmos 3 uses modality-specific encoders for this:
- Vision (understanding): A ViT encoder with 16×16 patches, followed by a 2×2 token merge MLP. Used for reasoning tasks.
- Vision (generation): A frozen VAE from Wan2.2, which compresses video temporally by 4× and spatially by 32×32. Used for video generation.
- Audio: A frozen audio VAE producing 25 tokens per second of stereo 48 kHz audio.
- Action: Domain-aware linear projections that map robot joints, camera poses, gripper states, and ego-vehicle trajectories into the shared latent space. Each embodiment (single-arm robot, humanoid, autonomous vehicle, camera) gets its own projection matrix but shares the MoT backbone.
All tokens flow into a single transformer — but they're split into two fundamentally different subsequences that drive different kinds of computation.
The autoregressive (AR) subsequence carries text tokens and ViT-encoded vision tokens. It processes information causally, like a standard language model. The diffusion (DM) subsequence carries VAE-encoded video tokens, audio tokens, and action tokens. These are noisy during training and denoised iteratively during inference using a flow-matching objective.
The critical rule for the DM tokens: clean conditioning tokens always come before noisy target tokens. So for Image-to-Video, the clean first frame sits in front of the noisy frames to generate. For Forward Dynamics, the clean action tokens sit before the noisy future video frames. This single layout rule enables all six generation modes from one unified format.
Two Towers, One Attention Layer
The MoT architecture's central insight is that reasoning and generation require different computations, but they benefit enormously from attending to each other. The solution: each transformer layer has two independent parameter sets — the Reasoner tower for AR tokens and the Generator tower for DM tokens — but they share a single attention operation.
The attention asymmetry is the key mechanism: the AR tower uses causal (triangular) self-attention — it can only see past tokens, like a standard autoregressive LM. The DM tower uses full bidirectional attention, with the keys and values from both AR and DM subsequences concatenated together. This means the generator can freely attend to the entire text prompt and all conditioning tokens, but the reasoner remains causally self-contained — the AR output is never contaminated by information from the noisy DM tokens.
Wait — when does contamination actually happen?
You might wonder: if the DM tokens are only noisy during the iterative denoising process, and the AR reasoning is done by the time we start generating, then where does the contamination come from? The answer is that the asymmetric attention matters during training, not inference. During training, every forward pass contains both subsequences simultaneously — the AR tokens and the noisy DM tokens sit in the same sequence and flow through the same transformer layers together. Without the separation, the AR tower's queries would attend to the noisy DM keys and values, learning to incorporate corrupted signals into its representations.
But doesn't causal attention already see backwards?
Here's the sharper version of the question: if the AR tokens use causal attention (each token attends to everything before it), and some DM tokens appeared before an AR token in the sequence, wouldn't that AR token causally attend to those noisy tokens? Yes — and Cosmos 3 prevents this through two independent mechanisms that reinforce each other:
- Layout constraint: The sequence format always places all AR tokens before all DM tokens. The layout is
[text, ViT, EOS, BOG | cond., noisy targets]. Since DM tokens only appear after the AR subsequence, a causal mask naturally can't reach them — they're in the future. This is why the EOS/BOG delimiters exist: they mark the hard boundary. - Architectural key-value separation: Even beyond ordering, the two towers produce separate projection outputs. Look at the formulas in the diagram above:
Q_AR → causal(K_AR, V_AR). The AR queries are only ever multiplied against AR keys and values. The DM keys and values (K_DM, V_DM) are never concatenated into the AR attention pool — they simply don't exist in the AR tower's computation. This is not just a positional mask; it's a hard architectural wall at the projection level.
So the answer is: both. The layout ensures that causal attention alone would be sufficient (AR tokens never have DM tokens behind them). And the key-value separation provides a belt-and-suspenders guarantee — even if the layout were scrambled, the AR tower would still never see DM representations because they're in a completely separate key-value pool.
At inference time, the situation is even simpler: the AR tokens are processed first and their key-value pairs are cached. Then the DM tokens are iteratively denoised over multiple steps, attending to the frozen AR cache at each step. The AR outputs are never recomputed, so there's no contamination path at all. But it's the training-time separation — both layout and architectural — that makes joint multi-task training work without degrading the reasoner.
Both towers are initialized from the same pre-trained Qwen3-VL weights (8B for Nano, 32B for Super). This means Cosmos 3 doesn't start from scratch — it inherits strong language and visual reasoning from the VLM, then extends those capabilities to generation.
Change What's Noisy, Change the Task
The most elegant aspect of Cosmos 3 is how the same model does six completely different things. The trick: which tokens are noisy determines what the model generates. During training, the model learns to denoise noisy tokens given clean context. During inference, you inject noise into whichever tokens you want to generate, and the model denoises them.
For action modeling specifically, Cosmos 3 encodes actions as pseudo-actions — relative SE(3) poses between consecutive frames using the 6D rotation representation. A single-arm robot maps to a 9D ego pose + 9D end-effector pose + 1D gripper = 19D vector. A humanoid with two arms maps to 9D ego + 9D×2 wrists + 15D×2 finger positions = 57D. Each embodiment type gets domain-specific input/output projection matrices (same weights shared in the MoT backbone), enabling zero-shot transfer across robot morphologies.
When One Position Unit Must Mean One Physical Second
Video at 24 FPS and video at 30 FPS are both "video" — but their tokens arrive at different rates. Audio produces 25 tokens per second. Action trajectories might sample at 10 Hz for a robot or 30 Hz for a vehicle. If the position embedding just counts token indices, a 30 FPS video would be spatially "stretched" compared to a 24 FPS video of the same scene, and audio tokens would land at completely wrong temporal positions relative to the video events they're supposed to accompany.
Cosmos 3 solves this with FPS modulation: instead of advancing the temporal position by 1 per token, it advances by δt = TPSbase / TPS, where TPSbase = 6 (24 FPS ÷ 4× compression) and TPS is the actual temporal sampling rate of each modality.
TPSvideo = FPS ÷ 4 ← 4× temporal compression by VAE
TPSaudio = 48000 ÷ 1920 ≈ 25 ← 48 kHz, hop=1920
TPSbase = 24 ÷ 4 = 6 ← 24 FPS is the canonical frame rate
The result: at any frame rate, one second of real-world time always spans exactly 6 position units. A thunder clap at t=1.2s in the audio stream lands at position 7.2, and the video frame at t=1.2s also lands at position 7.2 — regardless of whether the video is at 16 or 30 FPS. This alignment is what lets the model learn that a lightning bolt in frame N causes thunder at audio token N+k, and then reproduce that timing in generation.
State of the Art Across Five Task Families
Cosmos 3 is evaluated on an unusually broad suite — 48 reasoning benchmarks, plus image generation, video generation, audio synchronization, physical dynamics, and robot policy. The key question is whether the unified architecture loses ground to specialists. The answer, consistently, is no.
A few highlights worth noting: On Human World Bench — the most demanding egocentric manipulation benchmark where annotators judge instruction-following and physical plausibility frame by frame — Cosmos3-Super achieves 71.9, beating Veo-3.1 (67.8) by 4.1 points despite Veo being closed-source. On Physics-IQ, which tests whether generated physics actually matches real-world outcomes (not just perceptual quality), Cosmos3-Super at 43.8 beats Sora2 at 42.3. On robot policy, the Cosmos3-Nano-Policy-DROID model ranked #1 on RoboArena, the real-world crowdsourced robot benchmark — overtaking π0.5 and all other open models.
The key insight from the ablations: models initialized from the mid-trained checkpoint (which includes action data from diverse embodiments) consistently outperform those starting from just the pre-trained checkpoint. At 500 iterations of fine-tuning on a new robot embodiment (LIBERO), MT-init reaches 24.6% success while PT-init stays at 0.0%. This is the benefit of the unified world prior: the model has already learned that actions cause visual changes, and it transfers this knowledge to new robots much faster.