Two failed strategies for multi-task RL
RL has supercharged single-task diffusion fine-tuning. Feed in a reward signal — aesthetic score, OCR correctness, compositional faithfulness — and the model learns to maximize it. The problem: users want all three at once. A single generated image should look beautiful and render text correctly and place the pizza correctly to the right of the suitcase.
Two natural strategies both fail in practice:
- Joint RL — optimize all rewards simultaneously. Gradient directions conflict: making images more aesthetic can degrade OCR accuracy. Harder tasks are swamped by easier ones. Convergence slows dramatically.
- Cascade RL — optimize tasks sequentially, one stage at a time. Works per-stage, but the model catastrophically forgets earlier tasks as it trains on later ones. Also the slowest strategy (148+ GPU hours for three tasks).
The canvas above shows the core tension in vector form. In Joint RL, the aesthetic gradient and OCR gradient pull the model in different directions — the net update lands somewhere suboptimal for both. In Cascade RL, the model specializes sequentially but forgets as it goes. DiffusionOPD sidesteps both problems: each teacher trains cleanly in its own domain, then the student learns from all of them simultaneously via its own denoising rollouts.
Real numbers on the failure modes
A specialist teacher trained only for GenEval reaches 0.96 GenEval score — but its OCR score drops to 0.40 and aesthetics barely moves. A specialist OCR teacher inverts the picture. You can't just pick one.
| Model | GenEval | OCR | Aesthetic | Average |
|---|---|---|---|---|
| GenEval Teacher (specialist) | 0.96 | 0.40 | 5.24 | 0.473 |
| OCR Teacher (specialist) | 0.65 | 0.93 | 5.26 | 0.550 |
| Aesthetics Teacher (specialist) | 0.49 | 0.59 | 6.22 | 0.698 |
| DiffusionOPD (unified) | 0.96 | 0.94 | 6.15 | 0.929 |
What is OPD and why does it work for LLMs?
On-Policy Distillation (OPD) was developed for language models to address exactly the same tension: you want a single student model that can do multiple things well, but joint RL from scratch is chaotic. The insight is clean: decouple exploration from integration.
OPD's recipe in three moves:
The key property is that the KL between discrete token distributions has a closed form — it's just a sum over the vocabulary. This lets the gradient flow directly through the loss without any Monte Carlo sampling. That's the baseline that DiffusionOPD needs to lift to continuous denoising processes.
Train specialists, then distill in parallel
DiffusionOPD adapts OPD's recipe to the diffusion setting with a two-stage training paradigm that cleanly separates single-task exploration from multi-task integration.
Stage 1: Independent specialists
Each teacher is trained with the best-performing RL algorithm for its task. GenEval uses DiffusionNFT (fast convergence on compositional rewards). OCR and Aesthetics use GRPO-Guard. Critically, the teachers never see each other's gradients — there's no cross-task interference at all in Stage 1.
Stage 2: On-policy distillation
The student is initialized from the same pretrained SD3.5-Medium checkpoint. For each training round, it rolls out its own full 40-step denoising trajectory for a batch of prompts. All three frozen teachers evaluate this trajectory and provide per-step supervision via the closed-form KL objective. A single backward pass after accumulating all three task losses keeps each gradient update balanced.
Why diffusion KLs are just mean-matching
In discrete token generation, the per-step KL is a sum over vocabulary — trivially differentiable. For continuous diffusion denoising, this isn't obvious. DiffusionOPD's key theoretical result: the per-step KL between student and teacher reduces to a simple squared distance between their predicted velocities.
The derivation proceeds in two steps. First: both student and teacher define Gaussian transition kernels at each denoising step — the next state is a Gaussian centered on a deterministic mean. Second: both Gaussians share exactly the same covariance (it depends only on the noise scheduler, not on the model's output). For two Gaussians with the same covariance:
Why not use PPO here?
An alternative would be to treat the teacher as a per-step process reward model and optimize a PPO-style surrogate. The authors show that the expected gradient of the PPO objective equals the closed-form gradient — but PPO adds an extra stochastic term proportional to sampled Gaussian noise. For a Gaussian transition aⱼ = µS + σ̄ εⱼ, the score function term is εⱼ/σ̄ times the gradient. Unbiased — but nonzero variance, and completely absent from the closed-form approach.
There's a second advantage: the closed-form loss works equally well for deterministic ODE samplers (noise level = 0), where PPO has no valid policy density to compute log probabilities from. DiffusionOPD unifies both SDE and ODE sampling under one training objective.
Lower noise = faster convergence
During Stage 2 distillation, the student rolls out trajectories using a stochastic SDE sampler with a global noise level a. This controls how much randomness is injected at each denoising step. The authors run a careful ablation across three noise levels, measuring how quickly the student's reward score improves during training.
The pattern is stark: ODE (noise=0) is up to 5× more sample-efficient than SDE (noise=0.7). The intuition is that high noise during rollout makes the student's trajectory more erratic — the closed-form KL loss is still exact, but the variance of the trajectory estimates increases, slowing convergence. Reducing noise to zero (pure ODE) makes each rollout fully deterministic, giving the cleanest possible supervision signal.
This connects naturally to the theoretical analysis: the closed-form KL objective is valid for any noise level, but the gradient variance due to stochastic rollouts is minimized when the trajectory itself is deterministic.
State-of-the-art on all three benchmarks
DiffusionOPD achieves the best multi-task average score (0.929) across all evaluated methods — while requiring substantially less training compute than the nearest baseline, Cascade NFT (0.851 average, 148 GPU hours vs ~97 GPU hours for DiffusionOPD).
Full benchmark breakdown
| Model | GPU hrs | GenEval | OCR | Aesthetic | ImgRwd | Average |
|---|---|---|---|---|---|---|
| SD3.5-M + CFG | — | 0.63 | 0.59 | 5.36 | 0.85 | 0.484 |
| Multi-Task NFT | 128 | 0.95 | 0.96 | 5.41 | 1.08 | 0.715 |
| Multi-Task GRPO-Guard | 130 | 0.89 | 0.94 | 5.61 | 1.31 | 0.763 |
| Cascade NFT | 148 | 0.94 | 0.91 | 6.01 | 1.49 | 0.851 |
| DiffusionOPD (Ours) | ~97 | 0.96 | 0.94 | 6.15 | 1.50 | 0.929 |
Distillation method comparison
DiffusionOPD also outperforms alternative ways to distill from the same set of specialist teachers. The comparison is controlled: all methods receive identical teacher models and training data, and all use on-policy student rollouts except SFT (which trains on teacher-generated images).
The closed-form OPD objective consistently reaches the highest performance ceiling fastest. SFT lags behind because it trains on teacher-generated images (off-policy), so the student never learns from the states it would actually visit. DMD and TDM use on-policy rollouts but their objectives introduce additional variance compared to direct mean-matching.