NVIDIA Introduces a 4-Bit Pretraining Methodology Using NVFP4, Validated on a 12B Hybrid Mamba-Transformer at 10T Token Horizon
NVIDIA introduces a 4-bit pretraining methodology built around the NVFP4 microscaling format — combining selective BF16 layers, 16×16 Random Hadamard Transforms on Wgrad inputs, 2D weight scaling, and stochastic rounding on gradients — validated on a 12B hybrid Mamba-Transformer trained on 10 trillion tokens, the longest publicly documented 4-bit pretraining run, with downstream accuracy closely tracking the FP8 baseline (62.58% vs 62.62% on MMLU-Pro).
Article intelligence
Key points
- NVFP4 is a 4-bit microscaling format natively supported on Blackwell Tensor Cores, quantizing only linear-layer GEMMs while keeping other components in higher precision.
- Trained a 12B hybrid Mamba-Transformer on 10T tokens, achieving 62.58% on MMLU-Pro vs 62.62% for FP8 baseline.
- Four techniques ensure convergence: selective BF16 layers (~16%), 16×16 Random Hadamard Transforms, 2D weight block scaling, and stochastic rounding on gradients.
- NVFP4 achieves lower loss than MXFP4 at the same token budget (1.5% vs 2.5% relative loss gap at 1T tokens) and offers up to 3× throughput over FP8 on GB300.
Why it matters
This matters because nVFP4 is a 4-bit microscaling format natively supported on Blackwell Tensor Cores, quantizing only linear-layer GEMMs while keeping other components in higher precision.
Technical impact
May affect model selection, inference cost, product capability, and evaluation benchmarks.
Pretraining frontier-scale LLMs in FP8 is now standard practice, but moving to 4-bit floating point has remained an open research problem because narrower formats compress dynamic range and amplify quantization error at long token horizons. A new research from NVIDIA describes a pretraining methodology built around NVFP4, a 4-bit microscaling format supported natively by Blackwell Tensor Cores, and validates it by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The research team state this is the longest publicly documented training run in 4-bit precision to date. The resulting model attains 62.58% on MMLU-Pro 5-shot versus 62.62% for the FP8 baseline, and is supported in NVIDIA’s Transformer Engine.
What NVFP4 Actually is
To understand why NVFP4 is important, it helps to revisit how microscaling formats work. In a microscaling (MX) format, a contiguous block of low-precision elements shares a single scale factor, which is used to map the block back into a wider numerical range during the matrix multiply. MXFP4 uses 32-element blocks where each element is stored as E2M1 — 1 sign bit, 2 exponent bits, 1 mantissa bit — encoding only the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale factors are stored in UE8M0, which restricts them to powers of two.
NVFP4 changes three things. First, the block size drops from 32 to 16 elements, narrowing the dynamic range each scale has to cover. Second, block scale factors are stored in E4M3 rather than UE8M0, trading exponent range for mantissa precision so the per-block amax (absolute maximum) can be mapped much closer to the FP4 maximum representable. Third, NVFP4 adds a second scaling level: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves stay in range. The result is that at least 6.25% of values in each block — the per-block amax — are represented at near-FP8 precision, while the remainder sit in FP4.
On NVIDIA Blackwell, FP4 GEMMs run at 4× BF16 throughput on GB200 and 6× on GB300, which translates to roughly 2× and 3× speedups over FP8. Operand memory footprint is approximately halved compared to FP8.
https://arxiv.org/pdf/2509.25149
What’s Quantized — and What Isn’t
Only the GEMMs inside linear (fully-connected) layers Fprop, Dgrad, and Wgrad actually run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all attention components (softmax and the query-key and attention score-value batched GEMMs) stay in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are kept in FP32. Tensor parallel reductions run in BF16.
The Four-Part Training Methodology
Quantizing every linear-layer GEMM to NVFP4 with default settings (1×16 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) diverges early in training. NVIDIA’s approach stabilizes it with four components, and ablation studies on the 12B model show each is necessary.
Selective high precision: Linear layers in the first two and the final eight of the 62 blocks (about 16% of all linear layers) are kept in BF16. Ablations indicated that the final blocks are the sensitive ones because they require more dynamic range than FP4 provides; keeping only the final four blocks in BF16 was also enough for stable convergence.
Random Hadamard Transforms (RHT): Outliers in weight gradients are spread into an approximately Gaussian distribution by multiplying the input tiles with a 16×16 Hadamard matrix combined with a random ±1 sign vector. Because the orthogonal transforms cancel inside the dot-product, no math correction is needed in the GEMM. The d=16 size was chosen empirically: d=4 hurt convergence, d=128 gave similar results. RHT is applied only to the inputs of the weight-gradient (Wgrad) GEMM, and a single random sign vector is shared across all linear layers. Randomization itself was a no-op at the 1.2B scale but measurably improved the 12B run.
Two-dimensional (2D) block scaling for weights: Standard NVFP4 scales 1×16 blocks along the dot-product dimension. Because the backward pass transposes the weight tensor, the forward and backward passes end up with different quantized weights, breaking the chain rule. NVIDIA’s fix is to scale weights in 16×16 blocks so the same quantized representation is used in both passes. Activations and gradients keep 1×16 scaling, since they are less sensitive to this inconsistency.
Stochastic rounding on gradients: Round-to-nearest-even introduces systematic bias when applied to gradient tensors. Stochastic rounding rounds probabilistically based on distance to the two nearest representable values, removing that bias. The research team explicitly notes in research paper that stochastic rounding is detrimental when applied to forward-pass tensors, so it is restricted to gradients.
Results on the 12B Hybrid Mamba-Transformer
The 12B model uses the Nemotron-Nano-12B-v2-Base architecture — 62 blocks (6 Self-Attention, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 — trained with a Warmup-Stable-Decay schedule (constant LR through 80% of training, decay over the final 20%), batch size 736, sequence length 8192. The FP8 reference baseline follows the DeepSeek-V3 methodology: E4M3 elements, 128×128 weight blocks, 1×128 activation and gradient blocks, with the first block and last two blocks kept in BF16.
NVFP4 validation loss stays within 1% of the FP8 baseline during the stable phase and widens to slightly above 1.5% during decay. Downstream accuracy is comparable across most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding shows the largest gap — HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% — which the research team attributes partly to noisy final-checkpoint evaluation. The research team also documents a precision-switching technique: transitioning the forward pass from NVFP4 to BF16 starting at 8.2T tokens (about 18% of the schedule) reduced relative loss error from 1.5% to 0.5%.
NVFP4 vs MXFP4
On a separate 8B hybrid Mamba-Transformer trained on 1T tokens, NVFP4 reached a relative loss error of about 1.5% versus BF16, while MXFP4 stayed near 2.5%. To close the gap, MXFP4 required 1.36T tokens to match the NVFP4 1T-token loss — a 36% token overhead. The research team attributes the difference to NVFP4’s smaller block size and E4M3 scales, which preserve more of the FP4 dynamic range than MXFP4’s power-of-two UE8M0 scales (which can waste up to one binade and the ±4, ±6 samples in the worst case).
Marktechpost’s Visual Explainer
NVFP4 Pretraining Guide
01 / 10
● NVIDIA Technical Report
Pretraining Large Language Models with NVFP4
A 4-bit floating-point training recipe validated on a 12-billion-parameter hybrid Mamba-Transformer trained on 10 trillion tokens — the longest publicly documented 4-bit pretraining run to date.
12B
Parameters
10T
Training Tokens
62.58%
MMLU-Pro (vs 62.62 FP8)
SOURCE — arXiv:2509.25149v2 · NVIDIA · Available in Transformer Engine
01 — Context
Why move from FP8 to 4-bit pretraining
FP8 training is now standard for frontier LLM pretraining. Moving to FP4 promises a 2× to 3× boost in arithmetic throughput over FP8 and approximately half the operand memory — but narrower formats compress dynamic range and amplify quantization error at long token horizons.
The challenge is to preserve training stability and downstream accuracy across multi-trillion-token runs. This report presents a recipe that does both, using NVFP4, a 4-bit microscaling format with native support on NVIDIA Blackwell Tensor Cores.
GB200 Throughput
BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 4×
GB300 Throughput
BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 6×
02 — The Format
What NVFP4 actually stores
Each element is encoded as E2M1 — 1 sign, 2 exponent, 1 mantissa bit — representing one of: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6.
Every block of 16 contiguous elements shares a single E4M3 scale factor. A second FP32 per-tensor scale sits on top to keep the E4M3 block scales in range. The result: at least 6.25% of values in each block (the per-block amax) sit at near-FP8 precision.
FP8 scale
6
0.5
-2
-4
1
0
3
-1
2
4
-3
0.5
-1
2
0
4
E4M3 block scale Block amax (mapped to FP4 max) 16 FP4 elements
03 — Format Comparison
How NVFP4 differs from MXFP4
NVFP4 makes three design changes to the microscaling approach that meaningfully improve representation fidelity at 4 bits.
MXFP4
Block size 32
Element E2M1
Block scale UE8M0
Scale type Power of 2
Tensor scale None
NVFP4
Block size 16
Element E2M1
Block scale E4M3
Scale type Fractional
Tensor scale FP32
MXFP4’s power-of-two UE8M0 scales can waste up to one binade of dynamic range and lose the ±4 and ±6 FP4 samples after scale rounding. NVFP4’s E4M3 scales map the block amax much closer to the FP4 maximum.
04 — Scope
What runs in NVFP4 — and what doesn’t
Only the three GEMMs inside linear layers — Fprop, Dgrad, and Wgrad — actually run in NVFP4. Everything else stays in higher precision.
In NVFP4
Linear Fprop GEMM
Linear Dgrad GEMM
Linear Wgrad GEMM
In BF16 / FP32
Embeddings · Output head
Normalization layers
Non-linearities
Attention (softmax, QK, score-V)
Master weights · Optimizer states
TP reductions (BF16)
The “FP4 training” label applies to the most compute-heavy GEMMs, not to the full forward and backward graph.
05 — The Recipe
Four techniques required for convergence
Quantizing every linear-layer GEMM to NVFP4 with default settings — 1×16 block scaling everywhere, round-to-nearest-even, no transforms — diverges early in training. The recipe stabilizes it with four components. Ablations show each is necessary.
1
Selective High Precision
Keep ~16% of linear layers in BF16, concentrated in the final blocks. For the 12B model: first 2 + final 8 of 62 blocks.
2
Random Hadamard Transforms (RHT)
16×16 Hadamard matrix + random ±1 sign vector, applied only to Wgrad inputs. d=4 was worse; d=128 was similar to d=16.
3
2D Block Scaling for Weights
16×16 block scales for weights so forward and backward see the same quantized representation. Activations and gradients keep 1×16 scaling.
4
Stochastic Rounding on Gradients
Probabilistic rounding removes systematic gradient bias. Detrimental on forward-pass tensors — restrict to gradients only.
06 — Training Setup
The 12B hybrid Mamba-Transformer
The model uses the Nemotron-Nano-12B-v2-Base architecture: 62 blocks consisting of 6 Self-Attention, 28 FFN, and 28 Mamba-2 blocks.
Architecture
Blocks 62
Hidden dim 5120
FFN dim 20480
Q heads 40
KV heads 8
Mamba state dim 128
Training
Tokens 10T
Batch size 736
Sequence length 8192
Schedule WSD 80/20
Peak LR 4.5e-4
Weight decay 0.1
FP8 reference baseline follows DeepSeek-V3: E4M3 elements, 128×128 weight blocks, 1×128 activation/gradient blocks, with the first block and last two in BF16.
07 — Downstream Results
NVFP4 matches FP8 across most benchmarks
Validation loss stays within 1% of FP8 during the stable phase, widening to slightly above 1.5% during decay. Downstream accuracies tracked below.
BenchmarkFP8NVFP4
MMLU-Pro 5-shot62.6262.58
MMLU77.3676.57
AGIEval English CoT67.0170.31
GSM8K CoT89.0892.27
MATH83.3281.48
MGSM81.8785.53
HumanEval+59.9357.43
MBPP+59.1155.91
ARC Challenge91.8191.81
Coding shows the widest gap. Switching the forward pass to BF16 at 8.2T tokens (last 18%) reduces relative loss error from 1.5% to 0.5%.
08 — Format Efficiency
NVFP4 vs MXFP4 on the same 8B model
On an 8B hybrid Mamba-Transformer trained on the same data, NVFP4 converged to a meaningfully better loss than MXFP4 in the same token budget.
Loss vs BF16 @ 1T tokens
NVFP4 ~1.5% gap
MXFP4 ~2.5% gap
Tokens to match NVFP4 loss
NVFP4 1.00T
MXFP4 1.36T (+36%)
The 36% token overhead trans
[truncated for AI cost control]