NVIDIA Developer Blog · · 11 min read

How to Optimize Transformer-Based Models for Low-Precision Training

Mirrored from NVIDIA Developer Blog for archival readability. Support the source by reading on the original site.

How to Optimize Transformer-Based Models for Low-Precision Training

AI-Generated Summary

Like
Dislike
  • Low-precision formats such as FP8 and NVFP4, supported by NVIDIA Hopper and NVIDIA Blackwell GPUs, accelerate transformer training by optimizing GEMM operations, but the realized speedup is constrained by quantization overhead and kernel selection intricacies.
  • Microbenchmarking tools convert transformer configs and batch sizes into specific GEMM shapes, enabling empirical benchmarking across BF16, MXFP8, and NVFP4 precisions, with separate profiling for Fprop, Dgrad, and Wgrad to account for aspect ratio effects and kernel dispatch differences.
  • Empirical results on CodonFM 5B demonstrate that while large GEMMs (e.g., MLP Down) achieve up to 1.66x speedup for NVFP4 over MXFP8 in autocast mode, smaller GEMMs like attention output benefit minimally, and the theoretical hardware speedup (up to 3.48x in prequantized mode) is reduced in practice by quantization, block scaling, and stochastic rounding overheads unique to NVFP4.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Transformer architectures are the backbone of many modern large language and generative AI models. As these models grow in size, training runs consume more GPU hours and more engineering iteration time. Accelerating transformers is therefore not just a performance optimization, but directly affects how quickly teams can experiment and how large a model they can afford to train. NVIDIA Hopper and NVIDIA Blackwell GPUs help solve this problem by introducing low-precision operator support including FP8 and NVFP4.

Transformers spend much of their training time in GEMMs, and low-precision formats speed up training mainly by making those matrix multiplications faster and cheaper. However, your transformer config does not tell you which GEMMs are actually running in your model. If you want to understand where training time goes, you need to turn your transformer config and batch size into the exact M×K×N matrix shapes your model executes, then benchmark those shapes across precisions. This will help you determine the optimal precision for your architecture before committing to a more expensive training run.

NVIDIA Transformer Engine (TE) can handle quantization and kernel dispatch unlocking low precision formats. This post shows you how to move from high-level model settings to concrete GEMM workloads, profile them with a microbenchmark, and estimate where lower precision will actually translate into speedups to help you accelerate your transformer-based models. The use case features CodonFM, a language model for biology focused on RNA.  

Model configuration and training inputs

Suppose you’re working with a 5B-parameter model such as CodonFM 5B. It will have a config such as:

hidden_size: 4096
intermediate_size: 16384
num_attention_heads: 32
num_hidden_layers: 24

Your training configuration is:

micro_batch_size: 31
sequence_length: 512

The benchmark tool can then take these hyperparameters directly and then use a single command to derive GEMM shapes, benchmark them across precisions, and compute the full speedup analysis:

python benchmark.py \
  --hidden_size 4096 \
  --intermediate_size 16384 \
  --num_attention_heads 32 \
  --num_hidden_layers 24 \
  --micro_batch_size 31 \
  --sequence_length 512 \
  -o ./images/b300_model_config_speedup.png

Note: To disable Blackwell-specific flags, add --no-fp8 --no-fp4. --no-fp8 --no-fp4 provides BF16 plus the three tensor-wise FP8 recipes that work on Hopper.

  • --no-fp8 disables MXFP8 
  • --no-fp4 disables NVFP4

Using autocast mode versus prequantizing

By default, the tool runs in autocast mode, which is what TE does during training: inputs are dynamically quantized to the target precision before each GEMM, so the measured time includes both the quantization cost and the GEMM kernel itself. This provides you with the realistic per-GEMM picture during a training step.

The tool computes M = 31 × 512 = 15,872 tokens, derives all 12 GEMM shapes, benchmarks each across enabled precisions, and prints the full results. Fprop, Dgrad, and Wgrad shapes are all benchmarked separately to capture the impact of different matrix aspect ratios on kernel selection. 

By default, the tool runs in autocast mode, which is what TE does during training: inputs are dynamically quantized to the target precision before each GEMM, so the measured time includes both the quantization cost and the GEMM kernel itself. This provides you with the realistic per-GEMM picture during a training step.

The tool computes M = 31 × 512 = 15,872 tokens, derives all 12 GEMM shapes, benchmarks each across enabled precisions, and prints the full results. Fprop, Dgrad, and Wgrad shapes are all benchmarked separately to capture the impact of different matrix aspect ratios on kernel selection.

A grouped bar chart showing per-layer GEMM time in milliseconds across five precisions on NVIDIA B300. Each precision has two stacked bars representing Fprop+Dgrad and Wgrad time. BF16 has the tallest bars at about 12.8 ms total, decreasing through FP8 Current, FP8 Delayed, and MXFP8, with NVFP4 the shortest at about 6.5 ms total — showing roughly a 2× speedup from BF16 to NVFP4.
Figure 1. Per-layer GEMM time on NVIDIA B300 SXM6 AC in autocast mode, broken down by precision (BF16, FP8 Current, FP8 Delayed, MXFP8, NVFP4) and stage (Fprop+Dgrad and Wgrad)

To isolate raw GEMM kernel performance, add --pre-quantize. This prequantizes all inputs once before the timed loop, so the measured time reflects only the GEMM kernel execution—no dynamic quantization, no block scaling computation, no format conversion during the timed region.

Note that FP8 DelayedScaling always runs in autocast mode, even with --pre-quantize because it relies on an amax history that requires dynamic quantization. Its times are therefore not directly comparable to other precisions in prequantized mode.

python benchmark.py \
  --hidden_size 4096 \
  --intermediate_size 16384 \
  --num_attention_heads 32 \
  --num_hidden_layers 24 \
  --micro_batch_size 31 \
  --sequence_length 512 \
  --pre-quantize \
  -o ./images/b300_model_config_speedup_prequant.png
A grouped bar chart showing per-layer GEMM time in milliseconds across five precisions on NVIDIA B300 in pre-quantized mode. NVFP4 is dramatically faster than in autocast — about 3.8 ms total compared to BF16's 13.1 ms — showing the FP4 tensor cores' true potential when quantization overhead is removed.
Figure 2. Per-layer GEMM time on NVIDIA B300 SXM6 AC in prequantized mode, isolating raw kernel throughput without dynamic quantization overhead

Comparing the autocast and prequantized speedups tells you exactly how much quantization overhead costs: NVFP4 versus BF16 goes from 1.98x (autocast) to 3.48x (kernel-only). The gap between these two numbers is the overhead from dynamic quantization, Hadamard transforms, and block scaling that occurs in each training step.

Use autocast results for predicting real training speedups. This is what TE actually does during training. Use prequantized results to understand whether quantization overhead is the bottleneck, or to compare raw tensor core throughput across precisions independent of the quantization implementation.

Interpreting the results for a real model

This section walks through how to interpret these results for a real model. Using the same CodonFM 5B config, we ran the full model config benchmark on NVIDIA B300. The per-shape NVFP4 versus MXFP8 speedups from the Fprop results are as follows:

QKV proj:   0.579 / 0.392  =  1.48x
Attn out:   0.269 / 0.256  =  1.05x  (barely faster — overhead nearly matches GEMM gain)
MLP up:     0.924 / 0.635  =  1.46x
MLP down:   1.076 / 0.649  =  1.66x

Take note of the following points: 

  • The attention output GEMM receives minimal benefit from lower precision. Compared with the MXFP8 baseline, there is only a 1.05x speedup. This is the smallest weight matrix in the layer (4096×4096)—barely large enough for lower precision to overcome the overhead. By contrast, the much larger MLP Down GEMM delivers 1.66x NVFP4 over MXFP8 on the same hardware. The MLP down GEMM is big enough to amortize the quantization overhead, where attention output isn’t.
  • The big GEMMs show real but subtheoretical gains. The FP4 tensor cores deliver 1.46x to 1.66x over MXFP8 on the large GEMMs. This is well short of the theoretical 2x to 3x from the hardware spec. Once you include the attention output GEMM, the blended Fprop speedup drops to 1.47x. After adding Wgrad times, non-GEMM overhead and NVFP4-specific quantization costs, the end-to-end gap between NVFP4 and MXFP8 in training is consistent with these kernel-level numbers.
  • FP8 DelayedScaling is surprisingly competitive on NVIDIA Blackwell. At 7.80 ms/layer in autocast mode, it outperforms both FP8 CurrentScaling (9.15 ms) and MXFP8 (8.98 ms). In prequantized mode FP8 CurrentScaling pulls ahead (6.81 ms versus 8.12 ms), suggesting the DelayedScaling amax-history approach has lower quantization overhead but similar raw kernel throughput. This is a good example of the comparison between autocast and prequantized surfacing different winners depending on whether you measure with or without the quantization tax.
  • The prequantized results reveal the true kernel potential. Running with --pre-quantize removes quantization overhead entirely, and NVFP4 versus BF16 jumps from 1.98x (autocast) to 3.48x (kernel-only). This shows the FP4 tensor cores are delivering real speedups. It’s the quantization overhead in autocast mode that narrows the gap.
  • The Fprop versus Dgrad comparison reveals that the 2x approximation is imprecise for quantized formats. While BF16 Dgrad is within 2% of Fprop, quantized formats show 5–13% slower Dgrad sums. The QKV Proj Dgrad is especially asymmetric—33–51% slower than Fprop for FP8/FP4—because swapping K (4096) and N (12288) dramatically changes the matrix aspect ratio and kernel selection. This is exactly why the tool benchmarks Fprop and Dgrad separately rather than counting Fprop time twice.

Once you have the estimated GEMM-only speedup, compare it against your observed end-to-end training speedup:

  • GEMM speedup ≈ training speedup: GEMMs dominate the step, everything is working as expected
  • GEMM speedup >> training speedup: Overhead outside of GEMMs is eating the gains. For NVFP4 in particular, this overhead includes Random Hadamard transforms on Wgrad inputs, stochastic rounding on gradients, 2D block scaling for weights, and the extra memory pass for per-tensor amax computation. These are all additional ops that MXFP8 doesn’t need, and they can significantly narrow the gap even if the raw FP4 GEMMs are much faster
  • GEMM speedup ≈ 1.0 even in the microbenchmark. The FP4 kernels aren’t actually faster at these shapes, or they’re silently falling back to FP8

The last case is especially worth checking. Set NVTE_LOG_LEVEL=1 or inspect with NVIDIA Nsight Systems to confirm that TE is actually dispatching FP4 kernels. TE can silently fall back to FP8 or BF16 for layers or ops that don’t support FP4 yet, which would explain identical performance with no other symptoms. You can also compare GPU memory usage between MXFP8 and NVFP4 runs. If memory is nearly identical, that’s a strong signal that FP4 weights aren’t actually being stored.

Get started benchmarking your model for low-precision training 

Low-precision training speedups are highly dependent on the actual GEMM shapes your model runs and running in low precision does not automatically translate into end-to-end training gains, especially when quantization overhead, kernel selection, and non-GEMM operations are included. By turning a transformer config into concrete M×K×N workloads, you can benchmark BF16, MXFP8, and NVFP4 on the shapes that matter for your model before committing to a full training run.

Benchmark your GEMMs to see which precision is right for you. To get started, check out the benchmark script. For the full documentation and to understand how these shapes are derived, see the GEMM profiling tutorial in the Transformer Engine documentation.

Use this benchmark to:

  • Autocast results to set realistic training-speedup expectations
  • Prequantize results to know whether you’re bottlenecked on kernels or on quantization
  • Run candidate model configs through the tool before committing to a training run, as the tool is a useful architecture co-design instrument

Discuss (0)

Tags

Data Center / Cloud | Data Science | Simulation / Modeling / Design | Healthcare & Life Sciences | BioNeMo | Blackwell | CUDA | Hopper | Intermediate Technical | Tutorial | Drug Discovery | LLM Benchmarking | NVFP4 | Training AI Models | Transformers

About the Authors

Avatar photo
About Jonathan Mitchell
Jonathan is currently a Machine Learning Engineer on NVIDIA's BioNeMo team, where he develops scalable training algorithms for Transformers in biological applications. He previously worked as a Software Engineer on NVIDIA's Autonomous Vehicle (AV) Perception team. Jonathan completed his PhD in Computer Science at UCLA, focusing on generative modeling and adversarial robustness.
Avatar photo
About Paweł Gadziński
Paweł Gadziński is a deep learning performance engineer at NVIDIA, specializing in the development of the Transformer Engine library. He is passionate about deep learning frameworks and accelerating large-scale model training performance. He earned his degree in Computer Science from the University of Warsaw.
Avatar photo
About Zoey Zhang
Zoey Zhang is the product manager for AI training in Digital Biology at NVIDIA. Her background spans software engineering and machine learning research roles, with a degree in Biomedical Engineering from the University of Waterloo specializing in Medical AI and Computing. Zoey is passionate about accelerating scientific discovery and the development of life-saving treatments through AI and accelerated computing.
Avatar photo
About Kyle Tretina
Kyle Tretina is a product marketing leader at NVIDIA, focused on advancing AI for digital biology and drug discovery. He drives the strategy and storytelling behind BioNeMo and our work with BioPharma, shaping how next-generation foundation models and GPU-accelerated microservices transform molecular and protein design. With a PhD in molecular microbiology and immunology, Kyle bridges science and strategy, translating breakthroughs in AI, chemistry, and biology into platforms that accelerate discovery for researchers, startups, and pharmaceutical companies worldwide.

Comments

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from NVIDIA Developer Blog