Personal Research Blog

Exploring the frontier of AI & ML

Deep dives into research papers, model architectures, and open-source implementations — with interactive visualizations and annotated source code.

Featured

★ Featured
Gemma 4 Multi-Token Prediction Speculative Decoding Interactive

How Gemma 4's Multi-Token Prediction Works

A deep dive into the Gemma4 Assistant architecture: KV sharing between the backbone and draft model, centroid-based vocabulary prediction, and bidirectional attention masks — with animated visualizations and annotated source code.

Latest Posts

★ Latest
Flow MatchingRectified FlowMultimodalInteractive

Flow Matching: The Straight-Line Engine of Modern Multimodal GenAI

Why did Stable Diffusion 3, Flux.1, and NVIDIA Cosmos ditch traditional diffusion in favor of Flow Matching and Rectified Flow? An interactive deep dive into simulation-free training, Optimal Transport paths, and the mathematics of straight-line vector fields that generate high-quality text, images, and speech in 10 steps.

Speculative DecodingMulti-Token PredictionInteractive

Recycling MTP: Converting Pre-Trained Multi-Token Prediction Modules into Advanced Speculative Drafters

Building high-quality draft models is bottlenecked by data scarcity, while native MTP co-training is locked into fixed architectures. Here is how we can recycle pre-trained MTP layers post-hoc into advanced speculative architectures like EAGLE or Medusa, gaining the best of both worlds.

MultimodalTraining DynamicsInteractive

The Modality Regret: How the AI Industry Solved Multimodal Text Degradation

For years, transforming a highly capable text model into a multimodal model acted like a lobotomy, severely degrading its coding and mathematical reasoning. Here is the engineering history of how we diagnosed the "modality tax" and the architectural breakthroughs that finally fixed it.

World ModelsPhysical AIMultimodalInteractive

Cosmos 3: One World Model to Perceive, Reason, Generate, and Act

NVIDIA's Cosmos 3 collapses VLM + video generator + robot policy into a single 64B Mixture-of-Transformers: the same weights answer questions, synthesize video, generate audio, and predict actions — by simply changing which tokens are noisy. #1 open-source on T2I, T2V, I2V, and robot policy leaderboards.

Speculative DecodingScaling LawsLLM InferenceInteractive

Speculative Decoding Scaling Laws: The 200× Rule

Before training a single draft model, you can predict the optimal size: it should be ~200× smaller than the target. SDSL derives this analytically — α is an affine function of draft/target perplexity (R²=0.98), and threading pre-training scaling laws through the throughput formula yields N* = 2.71×10⁻³ · M + 87M, validated on OPT, Qwen, and LLaMA families.

AttentionLLM InferenceKV CacheInteractive

MLA: How DeepSeek Shrinks the KV Cache by 93% Without Losing Quality

Multi-head Latent Attention compresses keys and values into a 576-float latent vector — 57× smaller than standard MHA — using a low-rank projection and a matrix-absorption trick that eliminates key materialization entirely. The result: 5.76× faster generation and better-than-MHA quality in DeepSeek-V2.

LLM TrainingPerformanceInfrastructureInteractive

MFU: The Number That Tells You How Hard Your GPUs Are Actually Working

Model FLOPs Utilization answers how much of your GPU's 312 TFLOPS you actually use during training. Learn how to calculate it from first principles, what kills it (memory bandwidth, pipeline bubbles, communication), and why 30–70% is the real-world range — with an interactive calculator and benchmarks from PaLM, Megatron-LM, and FlashAttention-2.

MTP TrainingLayerNormTraining DiagnosticsInteractive

One Number Tells You Everything: Diagnosing MTP Training with LayerNorm Param-Norms

When training a Multi-Token Prediction module alongside your LLM, the L₂ norm of final_layernorm.weight starts at √d_model (e.g. 55.4 for d=3072) and tells you whether representations are healthy, over-specialized, or silently corrupting MTP gradient flow. This post shows what that number means, why it drifts, and what to do when it does.

Speculative DecodingLLM InferenceInteractive

Domino: Causal Quality at Parallel Speed

Autoregressive drafters have quality but pay a sequential tax; parallel drafters are fast but miss intra-block causal dependencies. Domino decouples these: a GRU head adds lightweight causal correction on top of a block-parallel drafter, gaining 16.6% acceptance length and 12.3% more speedup with only 2.8% extra latency — reaching 7.92× on GSM8K.

Knowledge DistillationLLM TrainingInteractive

Early Stopping Rollout: One Line That Fixes On-Policy Distillation

On-policy distillation breaks when long student rollouts poison the teacher signal — accuracy falls from 65% to 51% after 300 tokens. ESR fixes it with a single line: stop at N=100 tokens, recover 24× efficiency, and often surpass the teacher entirely.

On-Policy DistillationLLM Post-TrainingMechanistic AnalysisInteractive

OPD Has Foresight: Why Distillation Beats RL and How to Exploit It 3×

On-policy distillation locks onto the correct update direction within 10% of training—far earlier than RL. This foresight manifests as functional redundancy avoidance (OPD ignores useless layers) and early low-rank lock-in (dominant subspaces stabilize fast). EffOPD exploits both for a 3× training speedup with no extra modules.

Efficient PretrainingRecurrent ArchitectureFrom ScratchInteractive

HRM-Text: $1,500 to Match Trillion-Token Baselines

A 1B-parameter hierarchical recurrent model trained from scratch on 40 billion tokens for $1,500 reaches the benchmark neighborhood of 2–7B models trained on 4–36 trillion tokens — using 96–432× less compute. Three co-designed ingredients: dual-timescale HRM recurrence, MagicNorm stabilization, and a response-only PrefixLM training objective.

On-Policy DistillationLLM Post-TrainingInteractive

On-Policy Distillation Done Right: Three Failure Modes and One Fix

Standard on-policy distillation compares just one sampled token per update — imbalanced, drift-prone, and corrupted by tokenizer mismatches. Teacher top-K local support matching fixes all three by comparing distributions over a teacher-selected support set, delivering +19.8% math reasoning gain with no architectural changes.

Sparse AttentionLong ContextKV CacheInteractive

Full Attention Strikes Back: 9.36× Faster in 600 Steps

Full-attention LLMs are already intrinsically sparse. RTPurbo identifies the 15% of heads that do real long-range retrieval, routes them through a 16-dimensional indexer, and uses dynamic top-p to adapt token budgets per query — reaching near-lossless accuracy with 9.36× prefill speedup at 1M context in just ~600 training steps.

TheoryAttention MechanismsEmpirical BayesInteractive

Attention as In-Context Empirical Bayes

A minimal attention-only transformer implements a principled two-stage empirical Bayes denoiser — depth (Stage 1) refines a particle approximation to the unknown prior via reverse diffusion, while a long-range skip connection (Stage 2) queries it for Bayes-optimal posterior averaging. No noise schedule needed: a fixed bandwidth and T* = σ²/2 layers suffice, with convergence proved for Gaussian priors.

Diffusion RLMulti-Task LearningInteractive

DiffusionOPD: Multi-Task Alignment via On-Policy Distillation

Train specialist teachers for aesthetics, OCR, and composition independently, then distill all three into one unified diffusion model along the student's own denoising trajectories. A closed-form KL objective replaces PPO's noisy estimator, achieving 0.929 average score vs 0.851 for Cascade NFT — in 35% less GPU time.

Low-Precision TrainingLLM PretrainingInteractive

FP4 at Scale: NVFP4 Pretraining for LLMs

NVIDIA trained a 12B Mamba-Transformer on 10 trillion tokens in 4-bit NVFP4 precision—matching FP8 accuracy—using four targeted fixes: mixed precision layers, Hadamard outlier smoothing, 2D weight scaling, and stochastic gradient rounding. On Blackwell GB300 this unlocks 6× the BF16 arithmetic throughput.

Diffusion LLMsKnowledge DistillationCross-ArchitectureInteractive

TIDE: Shrinking Diffusion LLMs 22× Without Losing the Code Superpower

Cross-architecture distillation compresses a 16B MoE diffusion LLM into a 0.6B student with 22× less memory and a 16-point HumanEval advantage over same-size AR models. Three synergistic components — TIDAL for noise-adaptive scheduling, CompDemo for richer teacher context, and Reverse CALM for cross-tokenizer alignment — each address a unique barrier that AR distillation never had to solve.

Diffusion LMsScaling LawsContinuous DiffusionInteractive

RePlaid: Continuous Diffusion Scales Competitively with Discrete

Continuous diffusion was written off after Plaid reported a 64× compute gap vs AR. RePlaid closes it to 20× by aligning the architecture with modern discrete DLMs — and achieves 22.1 PPL on OpenWebText, better than MDLM's 23.1. Interactive visualizations show the structured token embedding geometry, the self-correcting noise schedule, and the unified scaling laws.

Diffusion LMTokenizationArchitectureInteractive

BitLM: Generating Phrases, Not Tokens, with Binary Diffusion

BitLM replaces the vocabulary softmax with an 18-bit binary code and a diffusion head that jointly denoises a block of 4 tokens at once — making parallel block-level generation native to the model's output interface, not a post-hoc trick. Scales smoothly to 8B on FineWeb-350BT with interactive denoising and binary hypercube visualizations.