DiffusionOPD: Multi-Task RL for Diffusion via On-Policy Distillation

01 — The Multi-Task Trap

Two failed strategies for multi-task RL

RL has supercharged single-task diffusion fine-tuning. Feed in a reward signal — aesthetic score, OCR correctness, compositional faithfulness — and the model learns to maximize it. The problem: users want all three at once. A single generated image should look beautiful and render text correctly and place the pizza correctly to the right of the suitcase.

Two natural strategies both fail in practice:

Joint RL — optimize all rewards simultaneously. Gradient directions conflict: making images more aesthetic can degrade OCR accuracy. Harder tasks are swamped by easier ones. Convergence slows dramatically.
Cascade RL — optimize tasks sequentially, one stage at a time. Works per-stage, but the model catastrophically forgets earlier tasks as it trains on later ones. Also the slowest strategy (148+ GPU hours for three tasks).

gradient_conflict.canvas Click a strategy to see its failure mode

The canvas above shows the core tension in vector form. In Joint RL, the aesthetic gradient and OCR gradient pull the model in different directions — the net update lands somewhere suboptimal for both. In Cascade RL, the model specializes sequentially but forgets as it goes. DiffusionOPD sidesteps both problems: each teacher trains cleanly in its own domain, then the student learns from all of them simultaneously via its own denoising rollouts.

Real numbers on the failure modes

A specialist teacher trained only for GenEval reaches 0.96 GenEval score — but its OCR score drops to 0.40 and aesthetics barely moves. A specialist OCR teacher inverts the picture. You can't just pick one.

Model	GenEval	OCR	Aesthetic	Average
GenEval Teacher (specialist)	0.96	0.40	5.24	0.473
OCR Teacher (specialist)	0.65	0.93	5.26	0.550
Aesthetics Teacher (specialist)	0.49	0.59	6.22	0.698
DiffusionOPD (unified)	0.96	0.94	6.15	0.929

02 — On-Policy Distillation

What is OPD and why does it work for LLMs?

On-Policy Distillation (OPD) was developed for language models to address exactly the same tension: you want a single student model that can do multiple things well, but joint RL from scratch is chaotic. The insight is clean: decouple exploration from integration.

OPD's recipe in three moves:

Student generates its own sequence

The student model autoregressively samples a full response from its current policy — not a replay buffer, not a reference model's outputs. On-policy means the student visits the states it would actually encounter at test time.

x = ["The", "cat", "sat", ...] # student-generated

Teacher scores every prefix

At each token position t, the frozen teacher model sees the same prefix x<t and outputs its full next-token distribution. The student gets dense supervision at every decoding step — not just a final reward.

KL(student("cat|The") ‖ teacher("cat|The"))

Minimize the sum of per-step KLs

The total objective is the sum of per-step reverse-KL divergences along the student's own trajectory. This is analytically differentiable — no REINFORCE, no high-variance policy gradient estimates.

L = Σ KL(πθ(·|x<t) ‖ π*(·|x<t))

Step 1 of 3

The key property is that the KL between discrete token distributions has a closed form — it's just a sum over the vocabulary. This lets the gradient flow directly through the loss without any Monte Carlo sampling. That's the baseline that DiffusionOPD needs to lift to continuous denoising processes.

OPD Objective (LLM) L OPD (θ) = E x\simπθ [Σ t KL(πθ(\cdot|x<t) ‖ π*(\cdot|x<t))] # KL between discrete distributions \to closed form, no variance

03 — Two-Stage Architecture

Train specialists, then distill in parallel

DiffusionOPD adapts OPD's recipe to the diffusion setting with a two-stage training paradigm that cleanly separates single-task exploration from multi-task integration.

diffusion_opd_architecture.svg Stage 1 trains specialists; Stage 2 distills them in parallel

Stage 1: Independent specialists

Each teacher is trained with the best-performing RL algorithm for its task. GenEval uses DiffusionNFT (fast convergence on compositional rewards). OCR and Aesthetics use GRPO-Guard. Critically, the teachers never see each other's gradients — there's no cross-task interference at all in Stage 1.

Stage 2: On-policy distillation

The student is initialized from the same pretrained SD3.5-Medium checkpoint. For each training round, it rolls out its own full 40-step denoising trajectory for a batch of prompts. All three frozen teachers evaluate this trajectory and provide per-step supervision via the closed-form KL objective. A single backward pass after accumulating all three task losses keeps each gradient update balanced.

04 — Closed-Form KL Trick

Why diffusion KLs are just mean-matching

In discrete token generation, the per-step KL is a sum over vocabulary — trivially differentiable. For continuous diffusion denoising, this isn't obvious. DiffusionOPD's key theoretical result: the per-step KL between student and teacher reduces to a simple squared distance between their predicted velocities.

The derivation proceeds in two steps. First: both student and teacher define Gaussian transition kernels at each denoising step — the next state is a Gaussian centered on a deterministic mean. Second: both Gaussians share exactly the same covariance (it depends only on the noise scheduler, not on the model's output). For two Gaussians with the same covariance:

KL between same-covariance Gaussians KL(𝒩(µ₁, σ²I) ‖ 𝒩(µ₂, σ²I)) = ‖µ₁ - µ₂‖² / 2σ² # The covariance cancels — KL is just squared mean distance, divided by variance

gaussian_kl.canvas Drag the teacher curve to see how KL changes

 ■ Student µS ■ Teacher µT ■ KL = ‖µS−µT‖²/2σ² 

Why not use PPO here?

An alternative would be to treat the teacher as a per-step process reward model and optimize a PPO-style surrogate. The authors show that the expected gradient of the PPO objective equals the closed-form gradient — but PPO adds an extra stochastic term proportional to sampled Gaussian noise. For a Gaussian transition aⱼ = µS + σ̄ εⱼ, the score function term is εⱼ/σ̄ times the gradient. Unbiased — but nonzero variance, and completely absent from the closed-form approach.

PPO gradient = closed-form gradient + noise term \nablaL PPO = \nablaL closed-form + ∆ⱼ(θ) \cdot (εⱼ/σ̄ⱼ) \cdot \nablaθµS # The ε noise term is unbiased but adds variance — closed-form drops it entirely

There's a second advantage: the closed-form loss works equally well for deterministic ODE samplers (noise level = 0), where PPO has no valid policy density to compute log probabilities from. DiffusionOPD unifies both SDE and ODE sampling under one training objective.

05 — Noise Level Ablation

Lower noise = faster convergence

During Stage 2 distillation, the student rolls out trajectories using a stochastic SDE sampler with a global noise level a. This controls how much randomness is injected at each denoising step. The authors run a careful ablation across three noise levels, measuring how quickly the student's reward score improves during training.

noise_ablation.canvas Click a noise level to highlight its convergence curve

SDE noise level a =

The pattern is stark: ODE (noise=0) is up to 5× more sample-efficient than SDE (noise=0.7). The intuition is that high noise during rollout makes the student's trajectory more erratic — the closed-form KL loss is still exact, but the variance of the trajectory estimates increases, slowing convergence. Reducing noise to zero (pure ODE) makes each rollout fully deterministic, giving the cleanest possible supervision signal.

This connects naturally to the theoretical analysis: the closed-form KL objective is valid for any noise level, but the gradient variance due to stochastic rollouts is minimized when the trajectory itself is deterministic.

06 — Results

State-of-the-art on all three benchmarks

DiffusionOPD achieves the best multi-task average score (0.929) across all evaluated methods — while requiring substantially less training compute than the nearest baseline, Cascade NFT (0.851 average, 148 GPU hours vs ~97 GPU hours for DiffusionOPD).

SD3.5-M baseline

0.484

Multi-Task NFT

0.715

Multi-Task GRPO-Guard

0.763

Cascade NFT (148 h)

0.851

DiffusionOPD (~97 h)

0.929

Full benchmark breakdown

Model	GPU hrs	GenEval	OCR	Aesthetic	ImgRwd	Average
SD3.5-M + CFG	—	0.63	0.59	5.36	0.85	0.484
Multi-Task NFT	128	0.95	0.96	5.41	1.08	0.715
Multi-Task GRPO-Guard	130	0.89	0.94	5.61	1.31	0.763
Cascade NFT	148	0.94	0.91	6.01	1.49	0.851
DiffusionOPD (Ours)	~97	0.96	0.94	6.15	1.50	0.929

Distillation method comparison

DiffusionOPD also outperforms alternative ways to distill from the same set of specialist teachers. The comparison is controlled: all methods receive identical teacher models and training data, and all use on-policy student rollouts except SFT (which trains on teacher-generated images).

SFT (teacher images)

low

DMD (distribution match)

mid

TDM (trajectory match)

good

DiffusionOPD (closed-form KL)

best

The closed-form OPD objective consistently reaches the highest performance ceiling fastest. SFT lags behind because it trains on teacher-generated images (off-policy), so the student never learns from the states it would actually visit. DMD and TDM use on-policy rollouts but their objectives introduce additional variance compared to direct mean-matching.