Benchmarking FP8 Stability: Where Gradients Go to Die

“Why spend 16 bits on a number when 8 will do?”

That’s the promise of FP8 training on NVIDIA’s Hopper and Blackwell architectures. You get twice the throughput and half the memory footprint. But there is no such thing as a free lunch in physics. When you squeeze a 32-bit gradient into 8 bits, you aren’t just losing precision; you are introducing Sparsity Outliers that can crash your training loop.

The Hybrid Split: E4M3 vs. E5M2

FP8 isn’t just one format. It’s a delicate balance between two different ways of representing a number.

E4M3 (Forward Pass): 4 bits for the exponent, 3 for the mantissa. This is for your weights and activations. You need more precision here because these values are relatively stable.
E5M2 (Backward Pass): 5 bits for the exponent, 2 for the mantissa. This is for your gradients. Gradients can vary by orders of magnitude in a single step, so you need more “dynamic range” to avoid hitting zero (underflow) or infinity (overflow).

If you use the wrong format for the wrong part of the graph, your loss curve won’t work.

Spotting the Divergence

When you’re training in FP8, you can’t just look at the total loss. You have to monitor the AMAX (Maximum Absolute Value) of your tensors.

In the NVIDIA Transformer Engine, the system uses “Delayed Scaling.” It looks at the maximum value from the previous iteration to decide how much to “scale” the numbers in the current iteration.

If your model encounters an activation outlier that is 10x larger than anything it saw in the last 100 steps, the FP8 scaling factor will be too aggressive. The number will “clip” (hit the maximum representable value), and your gradients will suddenly become meaningless.

Diagnostic Checklist:

Check the Attention Norms: This is usually where FP8 stability breaks first. LayerNorm and RMSNorm are extremely sensitive to precision loss.
Monitor Scaling Reciprocal: If your scaling factor is dropping toward zero, your model is trying to recover from frequent overflows.
Compare with BF16: If you hit a NaN and you’re not sure why, run 10 steps in BF16. If it doesn’t crash, your problem is FP8 stability, not your learning rate.

The “Delayed Scaling” Trap

Here is a snippet of what a scaling recipe looks like in the Transformer Engine. Notice how we have to track history to stay stable.

import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

# Define the stability guardrails
fp8_recipe = DelayedScaling(
    margin=0,            # How much "room" to leave at the top of the range
    interval=1,          # How often to update the scaling factor
    fp8_format=Format.HYBRID, # Use E4M3 for forward, E5M2 for backward
    amax_history_len=1024,    # Smooth out the outliers
    amax_compute_algo="max"   # Be conservative with the range
)

# Wrap your training step
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    output = model(input)
    loss = criterion(output, target)

Conclusion

FP8 is production-ready, but it’s not “hands-off.” It requires you to be an infra-architect more than just a model-user.

You have to respect the limits of the bits. If your gradients are dying, don’t just lower the learning rate. Look at your scaling history. Look at your outliers. In the world of 8-bit training, the physics of the numbers is just as important as the logic of the code.

#FP8 #NVIDIA #DeepLearning #Engineering #Performance #Hopper #Blackwell

Benchmarking FP8 Stability: Where Gradients Go to Die

The Hybrid Split: E4M3 vs. E5M2

Spotting the Divergence

Diagnostic Checklist:

The “Delayed Scaling” Trap

Conclusion

Related Posts

Performance over Portability? Running Local LLMs on the Asus ProArt 13

MoE Routing Collapse: When Your Specialists Stop Specializing

Blackwell's Sparse Attention Engines: The Reality of FP4

A2UI: The Interface is Now a Variable

The Hybrid Split: E4M3 vs. E5M2

Spotting the Divergence

Diagnostic Checklist:

The “Delayed Scaling” Trap

Conclusion

Related Posts

Performance over Portability? Running Local LLMs on the Asus ProArt 13

MoE Routing Collapse: When Your Specialists Stop Specializing

Blackwell's Sparse Attention Engines: The Reality of FP4

A2UI: The Interface is Now a Variable

Strictly Necessary

Analytics