World Models Physical AI Multimodal Interactive

Cosmos 3: One World Model to Perceive, Reason, Generate, and Act

Building a physical AI agent today means stitching together a VLM, a video generator, and a robot policy model — three separate systems that don't share representations. NVIDIA's Cosmos 3 collapses all of this into a single Mixture-of-Transformers backbone: the same 64B weights can answer questions about a video, synthesize its continuation, generate the soundtrack, and predict the robot's next action — by simply changing which tokens are noisy.

June 2, 2026 ~14 min read Technical Report: Cosmos 3
01 — The Fragmentation Problem

Three Models Where One Should Suffice

Consider a home robot instructed to clear a dinner table. Under the current paradigm, the robot needs three separate systems: a vision-language model to locate the dishes and generate a plan, a vision-language-action model to convert that plan into motor commands, and a world model to simulate future states and evaluate whether the planned action sequence is safe. These three systems are trained on different datasets, maintain different internal representations, and communicate through awkward intermediate formats.

This fragmentation is not just inelegant — it's fundamentally limiting. A world model that doesn't share weights with the VLM can't leverage the rich semantic understanding the VLM has built. A VLA that doesn't share weights with the world model can't simulate the visual consequences of its own actions. The systems are strung together like incompatible adapters.

current paradigm vs cosmos 3 fragmented pipeline → unified model
BEFORE VLM Reasoning VLA Action Policy World Model Simulation Robot ✗ no shared representations ✗ 3× inference cost ✗ mismatched training data AFTER Cosmos 3 Omnimodal World Model reason generate act ✓ shared representations ✓ 1× inference cost ✓ joint multi-task training
capability overlap — where duplication lives hover capabilities to highlight shared regions
Duplication Cost Summary — fragmented pipeline

Cosmos 3 argues that this paradigm separation is fundamentally limiting. Understanding requires reasoning about future world evolution and action consequences — capabilities that live in generative models. Generation relies on compact, structured world representations — capabilities that live in understanding models. The two are inseparable, and forcing them into different model classes wastes both the compute and the potential.

The solution: a single omnimodal world model trained jointly on language, image, video, audio, and action data, using a unified Mixture-of-Transformers (MoT) backbone that dedicates separate parameter sets to reasoning and generation while sharing attention across both.

02 — The Omnimodal Token Language

Everything Is Tokens — But Not All Tokens Are Equal

The first step to unifying five modalities is encoding them into a common representation space. Cosmos 3 uses modality-specific encoders for this:

  • Vision (understanding): A ViT encoder with 16×16 patches, followed by a 2×2 token merge MLP. Used for reasoning tasks.
  • Vision (generation): A frozen VAE from Wan2.2, which compresses video temporally by 4× and spatially by 32×32. Used for video generation.
  • Audio: A frozen audio VAE producing 25 tokens per second of stereo 48 kHz audio.
  • Action: Domain-aware linear projections that map robot joints, camera poses, gripper states, and ego-vehicle trajectories into the shared latent space. Each embodiment (single-arm robot, humanoid, autonomous vehicle, camera) gets its own projection matrix but shares the MoT backbone.

All tokens flow into a single transformer — but they're split into two fundamentally different subsequences that drive different kinds of computation.

token sequence — click a mode to see its layout AR tokens (reasoning) vs DM tokens (generation)
Text tokens (AR)
Special (EOS/BOG)
Clean conditioning
Noisy target (to denoise)

The autoregressive (AR) subsequence carries text tokens and ViT-encoded vision tokens. It processes information causally, like a standard language model. The diffusion (DM) subsequence carries VAE-encoded video tokens, audio tokens, and action tokens. These are noisy during training and denoised iteratively during inference using a flow-matching objective.

The critical rule for the DM tokens: clean conditioning tokens always come before noisy target tokens. So for Image-to-Video, the clean first frame sits in front of the noisy frames to generate. For Forward Dynamics, the clean action tokens sit before the noisy future video frames. This single layout rule enables all six generation modes from one unified format.

03 — Mixture of Transformers

Two Towers, One Attention Layer

The MoT architecture's central insight is that reasoning and generation require different computations, but they benefit enormously from attending to each other. The solution: each transformer layer has two independent parameter sets — the Reasoner tower for AR tokens and the Generator tower for DM tokens — but they share a single attention operation.

mixture-of-transformers architecture dual towers share attention — DM sees all, AR stays causal
text / ViT EOS BOG AR SUBSEQUENCE cond. noisy tgt DM SUBSEQUENCE TRANSFORMER LAYER ×L REASONER TOWER LayerNorm (AR) Attn projections (AR) FFN (AR) init from VLM weights Q_AR K_AR V_AR GENERATOR TOWER LayerNorm (DM) Attn projections (DM) FFN (DM) init from VLM weights Q_DM K_DM V_DM SHARED ATTENTION Q_AR → causal(K_AR, V_AR) Q_DM → full([K_AR;K_DM], [V_AR;V_DM]) AR never updated by DM next text token (AR out) denoised token (DM out) DM attends to AR keys

The attention asymmetry is the key mechanism: the AR tower uses causal (triangular) self-attention — it can only see past tokens, like a standard autoregressive LM. The DM tower uses full bidirectional attention, with the keys and values from both AR and DM subsequences concatenated together. This means the generator can freely attend to the entire text prompt and all conditioning tokens, but the reasoner remains causally self-contained — the AR output is never contaminated by information from the noisy DM tokens.

Wait — when does contamination actually happen?

You might wonder: if the DM tokens are only noisy during the iterative denoising process, and the AR reasoning is done by the time we start generating, then where does the contamination come from? The answer is that the asymmetric attention matters during training, not inference. During training, every forward pass contains both subsequences simultaneously — the AR tokens and the noisy DM tokens sit in the same sequence and flow through the same transformer layers together. Without the separation, the AR tower's queries would attend to the noisy DM keys and values, learning to incorporate corrupted signals into its representations.

But doesn't causal attention already see backwards?

Here's the sharper version of the question: if the AR tokens use causal attention (each token attends to everything before it), and some DM tokens appeared before an AR token in the sequence, wouldn't that AR token causally attend to those noisy tokens? Yes — and Cosmos 3 prevents this through two independent mechanisms that reinforce each other:

  • Layout constraint: The sequence format always places all AR tokens before all DM tokens. The layout is [text, ViT, EOS, BOG | cond., noisy targets]. Since DM tokens only appear after the AR subsequence, a causal mask naturally can't reach them — they're in the future. This is why the EOS/BOG delimiters exist: they mark the hard boundary.
  • Architectural key-value separation: Even beyond ordering, the two towers produce separate projection outputs. Look at the formulas in the diagram above: Q_AR → causal(K_AR, V_AR). The AR queries are only ever multiplied against AR keys and values. The DM keys and values (K_DM, V_DM) are never concatenated into the AR attention pool — they simply don't exist in the AR tower's computation. This is not just a positional mask; it's a hard architectural wall at the projection level.

So the answer is: both. The layout ensures that causal attention alone would be sufficient (AR tokens never have DM tokens behind them). And the key-value separation provides a belt-and-suspenders guarantee — even if the layout were scrambled, the AR tower would still never see DM representations because they're in a completely separate key-value pool.

At inference time, the situation is even simpler: the AR tokens are processed first and their key-value pairs are cached. Then the DM tokens are iteratively denoised over multiple steps, attending to the frozen AR cache at each step. The AR outputs are never recomputed, so there's no contamination path at all. But it's the training-time separation — both layout and architectural — that makes joint multi-task training work without degrading the reasoner.

attention mask matrix — who attends to whom click a task mode to see its attention pattern
AR → AR (causal)
DM → DM (full)
DM → AR (full cross)
Blocked (AR ✗ DM)

Both towers are initialized from the same pre-trained Qwen3-VL weights (8B for Nano, 32B for Super). This means Cosmos 3 doesn't start from scratch — it inherits strong language and visual reasoning from the VLM, then extends those capabilities to generation.

04 — Six Tasks, One Model

Change What's Noisy, Change the Task

The most elegant aspect of Cosmos 3 is how the same model does six completely different things. The trick: which tokens are noisy determines what the model generates. During training, the model learns to denoise noisy tokens given clean context. During inference, you inject noise into whichever tokens you want to generate, and the model denoises them.

generation modes — step through to see each task same weights, different clean/noisy assignment
1
Vision-Language Model (VLM)
Standard autoregressive text/image understanding. No diffusion subsequence active. The model answers questions about images and videos, generates captions, and reasons about scenes — exactly like Qwen3-VL.
S = [l₁ … lₙ] → next token prediction
2
Text-to-Image
Text prompt in AR, noisy image tokens in DM. The model denoises the image tokens conditioned on the language context. Uses a flow-matching objective that predicts the velocity field from noise to clean image.
S = [l₁…lₙ, EOS, BOG, ṽ₁] → denoise ṽ₁
3
Text-to-Video (+ optional Audio)
Text in AR, noisy video frames and (optionally) noisy audio tokens in DM. The model jointly denoises video and audio, learning to generate synchronized content. Audio tokens appended after video tokens.
S = [SAR, ṽ₁:N, s̃] → denoise video + audio
4
Image-to-Video (Conditioning)
First frame is encoded clean by the VAE and placed before the noisy frames as conditioning. P=1 gives I2V; P>1 gives video-to-video continuation. The model learns physical continuation of the world from the visual seed.
S = [SAR, v₁, ṽ₂:N] → denoise frames 2…N
5
Forward Dynamics (World Simulator)
Clean video context + clean action tokens predict noisy future video. Given what the robot did (clean actions), simulate what the world would look like. This is the forward model: action → future visual state.
S = [SAR, v_ctx, a_clean, ṽ_future] → denoise future
6
Robot Policy (Video + Action Co-generation)
Clean video context, then jointly noisy future video AND noisy action tokens. The model co-generates both the action sequence and the visual consequence simultaneously — a world-action model that reasons about what to do and what will happen.
S = [SAR, v_ctx, ṽ_future, ã_future] → denoise both
Step 1 of 6

For action modeling specifically, Cosmos 3 encodes actions as pseudo-actions — relative SE(3) poses between consecutive frames using the 6D rotation representation. A single-arm robot maps to a 9D ego pose + 9D end-effector pose + 1D gripper = 19D vector. A humanoid with two arms maps to 9D ego + 9D×2 wrists + 15D×2 finger positions = 57D. Each embodiment type gets domain-specific input/output projection matrices (same weights shared in the MoT backbone), enabling zero-shot transfer across robot morphologies.

05 — Aligning All Modalities in Time

When One Position Unit Must Mean One Physical Second

Video at 24 FPS and video at 30 FPS are both "video" — but their tokens arrive at different rates. Audio produces 25 tokens per second. Action trajectories might sample at 10 Hz for a robot or 30 Hz for a vehicle. If the position embedding just counts token indices, a 30 FPS video would be spatially "stretched" compared to a 24 FPS video of the same scene, and audio tokens would land at completely wrong temporal positions relative to the video events they're supposed to accompany.

Cosmos 3 solves this with FPS modulation: instead of advancing the temporal position by 1 per token, it advances by δt = TPSbase / TPS, where TPSbase = 6 (24 FPS ÷ 4× compression) and TPS is the actual temporal sampling rate of each modality.

FPS Modulation — temporal position increment
δt = TPSbase / TPS ← temporal step size per token
TPSvideo = FPS ÷ 4 ← 4× temporal compression by VAE
TPSaudio = 48000 ÷ 1920 ≈ 25 ← 48 kHz, hop=1920
TPSbase = 24 ÷ 4 = 6 ← 24 FPS is the canonical frame rate
temporal alignment — tokens at different frame rates map to the same physical timeline select a frame rate to see its token spacing

The result: at any frame rate, one second of real-world time always spans exactly 6 position units. A thunder clap at t=1.2s in the audio stream lands at position 7.2, and the video frame at t=1.2s also lands at position 7.2 — regardless of whether the video is at 16 or 30 FPS. This alignment is what lets the model learn that a lightning bolt in frame N causes thunder at audio token N+k, and then reproduce that timing in generation.

06 — Results

State of the Art Across Five Task Families

Cosmos 3 is evaluated on an unusually broad suite — 48 reasoning benchmarks, plus image generation, video generation, audio synchronization, physical dynamics, and robot policy. The key question is whether the unified architecture loses ground to specialists. The answer, consistently, is no.

cosmos 3 vs best open-source and closed models bars animate on scroll
T2V Video Generation — PAIBench-G Overall Score (higher = better)
Cosmos3-Super (ours)
80.0
Veo-3.1 (closed†)
79.1
Wan2.2-A14B
78.0
Cosmos3-Nano (ours)
79.4
Physical Plausibility — Physics-IQ I2V Score (higher = better)
Cosmos3-Super (ours)
43.8
Sora2 (closed†)
42.3
Wan2.2-A14B
38.3
Human Egocentric I2V — Human World Bench Score (higher = better)
Cosmos3-Super (ours)
71.9
Veo-3.1 (closed†)
67.8
Wan2.2-A14B
60.7
Image Generation — UniGenBench-All Score (higher = better)
Cosmos3-Super-T2I (ours)
91.36
Gemini 3 Pro Image†
90.69
FLUX.2-dev
87.60
Robot Policy — RoboLab Success Rate, Specific Instructions (higher = better)
Cosmos3-Nano-Policy (ours)
39.7%
π0.5 (Physical Intel.)
28.1%
DreamZero
25.2%

A few highlights worth noting: On Human World Bench — the most demanding egocentric manipulation benchmark where annotators judge instruction-following and physical plausibility frame by frame — Cosmos3-Super achieves 71.9, beating Veo-3.1 (67.8) by 4.1 points despite Veo being closed-source. On Physics-IQ, which tests whether generated physics actually matches real-world outcomes (not just perceptual quality), Cosmos3-Super at 43.8 beats Sora2 at 42.3. On robot policy, the Cosmos3-Nano-Policy-DROID model ranked #1 on RoboArena, the real-world crowdsourced robot benchmark — overtaking π0.5 and all other open models.

The key insight from the ablations: models initialized from the mid-trained checkpoint (which includes action data from diverse embodiments) consistently outperform those starting from just the pre-trained checkpoint. At 500 iterations of fine-tuning on a new robot embodiment (LIBERO), MT-init reaches 24.6% success while PT-init stays at 0.0%. This is the benefit of the unified world prior: the model has already learned that actions cause visual changes, and it transfers this knowledge to new robots much faster.