NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B
NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture: autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. Available in 3B, 8B, and 14B parameter sizes with base, instruct, and vision-language variants. Self-speculation mode achieves up to 6× tokens per forward over Qwen3-8B while maintaining competitive accuracy. The model is open-source and supports flexible deployment across different concurrency scenarios.
Article intelligence
Key points
- Nemotron-Labs-Diffusion integrates AR, diffusion, and self-speculation decoding in a single model with no architectural changes. Switching modes is done at inference time by changing attention patterns.
- At 8B scale, linear self-speculation delivers 5.99× tokens per forward with 62.81% accuracy, outperforming Qwen3-8B in throughput and accuracy.
- Training uses a joint AR-diffusion objective with α=0.3, two-stage training on 1.3T tokens, initialized from Ministral3 and trained on 256 NVIDIA H100 GPUs.
- Open-source release on Hugging Face with simple API for all three modes; requires transformers≥5.0.0 and trust_remote_code=True.
Why it matters
This matters because nemotron-Labs-Diffusion integrates AR, diffusion, and self-speculation decoding in a single model with no architectural changes. Switching modes is done at inference time by changing attention patterns.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants.
Sequential Decoding Limits Throughput
Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens. This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes — the typical setting for single-user or edge deployment.
Diffusion language models (LMs) offer a different approach. Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance. A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language.
https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL
What Is a Tri-Mode Language Model?
Nemotron-Labs-Diffusion is trained on a joint AR-diffusion objective. At inference time, it operates in three modes depending on the deployment context. There are no mode-specific architectural modifications — the same weights serve all three modes.
AR mode is standard left-to-right autoregressive decoding using causal attention. This mode is best suited for high-concurrency cloud serving.
Diffusion mode denoises multiple tokens in parallel within a fixed-length block. The sequence is partitioned into contiguous blocks. Within each block, tokens attend bidirectionally. Across blocks, attention remains causal, so prior blocks can reuse their KV cache. A lightweight trained sampler predicts, per masked position, whether the model’s top-1 prediction at the current denoising step is correct. Positions predicted as correct are committed in that step. This allows the model to commit multiple tokens per forward pass.
Self-speculation mode uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them, within the same single model. No auxiliary draft model or separate prediction head is required. The diffusion pathway generates a block of k candidate tokens in parallel. The AR pathway then runs a second forward pass over those candidates using causal attention, verifying the longest contiguous prefix that matches AR predictions. Each cycle produces between 1 and k+1 verified tokens. This contrasts with Multi-Token Prediction (MTP) methods such as Eagle3, which use small auxiliary draft heads attached to an AR backbone.
Training
The joint training objective combines an AR next-token prediction loss and a block-wise diffusion denoising loss:
ℒ(θ) = ℒ_AR(θ) + α · ℒ_diff(θ)
The coefficient α is set to 0.3 across all training stages. Ablation experiments varying α from 0.1 to 1.0 show that both AR-mode and diffusion-mode accuracy peak at α = 0.3. No value in the range [0.1, 0.5] improves one mode at the expense of the other — the two objectives rise and fall together.
Two-stage training first trains the model purely on the AR objective for 1 trillion tokens, building strong left-to-right linguistic priors. Stage 2 then introduces the joint objective for 300 billion additional tokens. In ablations, two-stage training contributed +5.74% average accuracy. Adding the AR loss contributed the single largest gain at +7.48%. Global loss averaging — treating all tokens across a batch equally rather than averaging per-sequence first — contributed +2.12% by reducing gradient variance from variable diffusion masking ratios. Cumulatively, the full training pipeline improved the baseline by 16.05% average accuracy.
All models are initialized from pretrained Ministral3 base models, not trained from scratch. Training was performed on 256 NVIDIA H100 GPUs. Instruct models are trained via supervised fine-tuning (SFT) on 45 billion tokens on top of the base models, using the same joint AR-diffusion objective with α = 0.3. The training and inference pipeline is released through Megatron Bridge.
LoRA-Enhanced Linear Self-Speculation
The base diffusion-to-AR alignment in self-speculation can be improved with a LoRA adapter. This adapter is fine-tuned on the diffusion draft pathway to better align its output with the AR verifier. It targets only the o_proj layer of the attention module (rank 128, α = 512, approximately 36M trainable parameters, 0.4% of the backbone). LoRA tuning improves tokens per forward (TPF) by 14.4%, 32.5%, and 27.6% at the 3B, 8B, and 14B scales respectively, with negligible accuracy change.
Speed-of-Light Analysis
The research team reports a speed-of-light (SOL) analysis — a theoretical upper bound on tokens per forward pass achievable by the diffusion mode, assuming an oracle sampler that correctly identifies all positions that can be safely committed in parallel.
At block length 32, the SOL acceptance rate reaches 7.60× on average, exceeding 10× on coding and multilingual tasks. Current confidence-based sampling achieves approximately 3× TPF at comparable accuracy, leaving a large gap to the SOL ceiling.
Comparing against linear self-speculation: both approach similar acceptance rates (6.82× for linear self-speculation vs. 7.60× SOL). However, the real tokens per forward pass (TPF) gap is much larger — 6.02× for SOL versus 3.41× for linear self-speculation, a 76.5% difference. Linear self-speculation requires two forward passes per cycle (one diffusion draft, one AR verify) and accepts only a contiguous prefix. These two constraints cap its real TPF well below SOL, even when drafter and verifier are well aligned.
https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL
Benchmark Results
On the 10-task instruct evaluation (HumanEval, MBPP, LiveCodeBench-CPP, GSM8K, Math500, AIME24, AIME25, GPQA, IFEval, MMLU):
NLD-8B AR mode: 63.61% average accuracy, versus 62.75% for Qwen3-8B and 58.02% for Ministral3-8B-Instruct.
NLD-8B diffusion mode: 63.18% average accuracy with 2.57× TPF.
NLD-8B LoRA-tuned linear self-speculation: 62.81% average accuracy with 5.99× TPF.
NLD-8B quadratic self-speculation: 64.04% average accuracy with 6.38× TPF.
On SPEED-Bench with SGLang on an NVIDIA GB200 GPU, linear self-speculation achieves 4× higher throughput than Qwen3-8B and 3.3× speedup over the NLD-8B AR mode at concurrency 1 (3.97× with an optimized CUDA kernel). Compared to Qwen3-8B-Eagle3, linear self-speculation delivers a 2.4×, 2.3×, and 1.8× speedup at batch size 1 on GB200, RTX Pro 6000, and DGX Spark respectively.
Acceptance length is the underlying reason for this advantage. Across SPEED-Bench categories, NLD achieves average acceptance lengths of 5.46 (native) and 6.82 (with LoRA) tokens per draft step. Eagle3 averages 2.75 and Qwen3-9B-MTP averages 4.24. On the four diffusion-friendly categories — coding, math, reasoning, and multilingual — the gap widens further: 8.69 for NLD-LoRA versus 2.81 for Eagle3.
At 14B scale with LoRA-tuned linear self-speculation, NLD-14B achieves 66.36% average accuracy at 5.96× TPF, outperforming Qwen3-14B at 65.17% accuracy in AR mode.
The vision-language model, Nemotron-Labs-Diffusion-VLM-8B, extends the same framework to multimodal tasks. In linear self-speculation mode, it achieves 3.63× to 7.45× TPF — the higher end for responses over 200 tokens — with a 0.1% average accuracy drop versus AR mode.
Marktechpost’s Visual Explainer
01 / 07
What is Nemotron-Labs-Diffusion?
A single model checkpoint. Three decoding modes. No architecture changes.
Nemotron-Labs-Diffusion is a language model family from NVIDIA that combines autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding in one set of weights. You switch modes at inference time by changing the attention pattern — no separate model files needed.
Sizes: 3B · 8B · 14B
Variants: Base · Instruct · VLM
Requires: transformers ≥ 5.0.0
License: NVIDIA Nemotron Open Model
5.99×
Tokens per forward vs Qwen3-8B (Linear Self-Speculation, 8B)
3.3×
Throughput over AR mode at concurrency 1 (GB200)
2.4×
Faster than Qwen3-8B-Eagle3 at batch size 1 (GB200)
63.61%
Avg accuracy, 8B AR mode vs 62.75% Qwen3-8B
The Three Decoding Modes
Same weights. Different attention pattern. Pick based on your deployment.
Mode 1
AR Decoding
Standard left-to-right generation using causal attention. One token per forward pass. Compatible with all existing AR serving infrastructure.
Best for: high-concurrency cloud serving where GPU compute is fully saturated by batching.
Mode 2
Diffusion Decoding
Denoises multiple tokens per block in parallel. Adjust the threshold value to trade accuracy for higher throughput. 2.57× TPF at threshold 0.9.
Best for: flexible accuracy–throughput tradeoff from one model.
Mode 3
Self-Speculation
Diffusion drafts k tokens in parallel. AR verifies them in a second pass. Accepts the longest matching prefix. No auxiliary model or extra heads needed.
Best for: low-concurrency or single-user inference where per-user speed matters most.
How mode switching works: You call a different method on the same model object — ar_generate(), generate(), or linear_spec_generate(). The model weights do not change.
Installation
Two pip installs. CUDA-capable GPU required.
The model uses trust_remote_code=True because custom modeling code is bundled with the checkpoint on Hugging Face. Install peft only if you plan to use the LoRA-enhanced self-speculation mode.
Step 1 — core dependencies
pip install "transformers>=5.0.0" torch accelerate
Step 2 — optional: LoRA-enhanced self-speculation
pip install peft
Step 3 — load model (swap model ID for 3B or 14B)
from transformers import AutoModel, AutoTokenizer import torch
Available: nvidia/Nemotron-Labs-Diffusion-3B
nvidia/Nemotron-Labs-Diffusion-8B
nvidia/Nemotron-Labs-Diffusion-14B
repo = "nvidia/Nemotron-Labs-Diffusion-8B"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True) model = AutoModel.from_pretrained(repo, trust_remote_code=True) model = model.cuda().to(torch.bfloat16)
Basic Usage — All Three Modes
Prepare the prompt once. Choose a generate call.
All three modes share the same tokenization step. The variable nfe (num function evals) returned alongside output IDs lets you measure how many forward passes were used to produce the output.
Shared — build prompt_ids
history = [{"role": "user", "content": "Explain gradient descent."}] prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True) prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
AR Mode — standard autoregressive
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)
Diffusion Mode — parallel decoding (threshold adjusts speed vs accuracy)
out_ids, nfe = model.generate( prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id )
Decode output — same for all modes
text = tokenizer.batch_decode( out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True )[0] print(f"Output: {text}\nNFE: {nfe}")
Self-Speculation + LoRA Drafter
Highest per-user throughput. Optional LoRA for higher acceptance length.
Without LoRA, average acceptance length is 5.46 tokens per draft step. With LoRA it rises to 6.82, versus 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP. The LoRA adapter i
[truncated for AI cost control]