Deep dives into research papers, model architectures, and open-source implementations — with interactive visualizations and annotated source code.
A deep dive into the Gemma4 Assistant architecture: KV sharing between the backbone and draft model, centroid-based vocabulary prediction, and bidirectional attention masks — with animated visualizations and annotated source code.
Why did Stable Diffusion 3, Flux.1, and NVIDIA Cosmos ditch traditional diffusion in favor of Flow Matching and Rectified Flow? An interactive deep dive into simulation-free training, Optimal Transport paths, and the mathematics of straight-line vector fields that generate high-quality text, images, and speech in 10 steps.
Building high-quality draft models is bottlenecked by data scarcity, while native MTP co-training is locked into fixed architectures. Here is how we can recycle pre-trained MTP layers post-hoc into advanced speculative architectures like EAGLE or Medusa, gaining the best of both worlds.
For years, transforming a highly capable text model into a multimodal model acted like a lobotomy, severely degrading its coding and mathematical reasoning. Here is the engineering history of how we diagnosed the "modality tax" and the architectural breakthroughs that finally fixed it.
NVIDIA's Cosmos 3 collapses VLM + video generator + robot policy into a single 64B Mixture-of-Transformers: the same weights answer questions, synthesize video, generate audio, and predict actions — by simply changing which tokens are noisy. #1 open-source on T2I, T2V, I2V, and robot policy leaderboards.
Before training a single draft model, you can predict the optimal size: it should be ~200× smaller than the target. SDSL derives this analytically — α is an affine function of draft/target perplexity (R²=0.98), and threading pre-training scaling laws through the throughput formula yields N* = 2.71×10⁻³ · M + 87M, validated on OPT, Qwen, and LLaMA families.
Multi-head Latent Attention compresses keys and values into a 576-float latent vector — 57× smaller than standard MHA — using a low-rank projection and a matrix-absorption trick that eliminates key materialization entirely. The result: 5.76× faster generation and better-than-MHA quality in DeepSeek-V2.
Model FLOPs Utilization answers how much of your GPU's 312 TFLOPS you actually use during training. Learn how to calculate it from first principles, what kills it (memory bandwidth, pipeline bubbles, communication), and why 30–70% is the real-world range — with an interactive calculator and benchmarks from PaLM, Megatron-LM, and FlashAttention-2.
When training a Multi-Token Prediction module alongside your LLM, the L₂ norm of final_layernorm.weight starts at √d_model (e.g. 55.4 for d=3072) and tells you whether representations are healthy, over-specialized, or silently corrupting MTP gradient flow. This post shows what that number means, why it drifts, and what to do when it does.
Autoregressive drafters have quality but pay a sequential tax; parallel drafters are fast but miss intra-block causal dependencies. Domino decouples these: a GRU head adds lightweight causal correction on top of a block-parallel drafter, gaining 16.6% acceptance length and 12.3% more speedup with only 2.8% extra latency — reaching 7.92× on GSM8K.
On-policy distillation breaks when long student rollouts poison the teacher signal — accuracy falls from 65% to 51% after 300 tokens. ESR fixes it with a single line: stop at N=100 tokens, recover 24× efficiency, and often surpass the teacher entirely.
On-policy distillation locks onto the correct update direction within 10% of training—far earlier than RL. This foresight manifests as functional redundancy avoidance (OPD ignores useless layers) and early low-rank lock-in (dominant subspaces stabilize fast). EffOPD exploits both for a 3× training speedup with no extra modules.
A 1B-parameter hierarchical recurrent model trained from scratch on 40 billion tokens for $1,500 reaches the benchmark neighborhood of 2–7B models trained on 4–36 trillion tokens — using 96–432× less compute. Three co-designed ingredients: dual-timescale HRM recurrence, MagicNorm stabilization, and a response-only PrefixLM training objective.
Standard on-policy distillation compares just one sampled token per update — imbalanced, drift-prone, and corrupted by tokenizer mismatches. Teacher top-K local support matching fixes all three by comparing distributions over a teacher-selected support set, delivering +19.8% math reasoning gain with no architectural changes.
Full-attention LLMs are already intrinsically sparse. RTPurbo identifies the 15% of heads that do real long-range retrieval, routes them through a 16-dimensional indexer, and uses dynamic top-p to adapt token budgets per query — reaching near-lossless accuracy with 9.36× prefill speedup at 1M context in just ~600 training steps.
A minimal attention-only transformer implements a principled two-stage empirical Bayes denoiser — depth (Stage 1) refines a particle approximation to the unknown prior via reverse diffusion, while a long-range skip connection (Stage 2) queries it for Bayes-optimal posterior averaging. No noise schedule needed: a fixed bandwidth and T* = σ²/2 layers suffice, with convergence proved for Gaussian priors.
Train specialist teachers for aesthetics, OCR, and composition independently, then distill all three into one unified diffusion model along the student's own denoising trajectories. A closed-form KL objective replaces PPO's noisy estimator, achieving 0.929 average score vs 0.851 for Cascade NFT — in 35% less GPU time.
NVIDIA trained a 12B Mamba-Transformer on 10 trillion tokens in 4-bit NVFP4 precision—matching FP8 accuracy—using four targeted fixes: mixed precision layers, Hadamard outlier smoothing, 2D weight scaling, and stochastic gradient rounding. On Blackwell GB300 this unlocks 6× the BF16 arithmetic throughput.
Cross-architecture distillation compresses a 16B MoE diffusion LLM into a 0.6B student with 22× less memory and a 16-point HumanEval advantage over same-size AR models. Three synergistic components — TIDAL for noise-adaptive scheduling, CompDemo for richer teacher context, and Reverse CALM for cross-tokenizer alignment — each address a unique barrier that AR distillation never had to solve.
Continuous diffusion was written off after Plaid reported a 64× compute gap vs AR. RePlaid closes it to 20× by aligning the architecture with modern discrete DLMs — and achieves 22.1 PPL on OpenWebText, better than MDLM's 23.1. Interactive visualizations show the structured token embedding geometry, the self-correcting noise schedule, and the unified scaling laws.
BitLM replaces the vocabulary softmax with an 18-bit binary code and a diffusion head that jointly denoises a block of 4 tokens at once — making parallel block-level generation native to the model's output interface, not a post-hoc trick. Scales smoothly to 8B on FineWeb-350BT with interactive denoising and binary hypercube visualizations.