· AI Engineering · 4 min read
Demystifying Nvidia Blackwell: FP4 and Microscaling Explained
Break down the new FP4 format and microscaling scale factors in the NVIDIA Blackwell architecture. Understand how it differs from FP8 and its impact on AI training.
- Blackwell's 20 Petaflops of AI performance requires FP4 precision and the Second-Gen Transformer Engine to avoid model collapse.
- Micro-Tensor Scaling dynamically scales 16-element matrix blocks, preserving precision for outlier features better than per-tensor scaling.
- Achieving maximum speed requires fine-tuning models with 2:4 Structured Sparsity to utilize the dedicated Sparse Tensor Cores.
- Tensor Memory (TMEM) adds a dedicated on-chip buffer for tensor operations, reducing Shared Memory bottlenecks.
- The NVLink Switch unifies memory across racks, requiring developers to shift from Data Parallelism to Model Parallelism.
Beyond the Hype: The “20 Petaflops” Asterisk
When Jensen Huang announced the Blackwell B200 with “20 Petaflops” of AI performance, the engineering world cheered. But that number comes with a significant caveat: it requires FP4 precision. For context, most production LLMs today run on FP16 or BF16.
If you strictly quantize a modern LLM (like Llama 3) to 4 bits, it collapses. The “outlier features” - specific neurons that activate 100x stronger than their neighbors - get clipped, and the model turns to gibberish.
Blackwell solves this not with better math, but with better hardware: the Second-Gen Transformer Engine.
The Physics of NVFP4: Micro-Tensor Scaling
Standard quantization techniques (like INT8) often use Per-Tensor or Per-Channel scaling. This means an entire row of the weight matrix shares a single scaling factor.
- Problem: If one weight is 100.0 and the rest are 0.1, the scaler accommodates the 100.0, and the 0.1s get crushed to zero. Precision is lost.
Blackwell introduces Micro-Tensor Scaling (based on the OCP Microscaling format). Instead of one scaler for the whole tensor, Blackwell divides the matrix into blocks of 16 elements.
- Block Size: 16 elements.
- Format: Each block gets a shared 8-bit exponent (E8M0).
- Payload: Check individual weights are 4-bit.
Key Insight: This allow the hardware to “zoom in” on the quiet parts of the matrix and “zoom out” for the loud outliers dynamically. It adapts the dynamic range 16 times more frequently than H100 could.
2:4 Structured Sparsity: The Hardware Tax
Blackwell doesn’t just want small numbers; it wants fewer numbers. The architecture includes Sparse Tensor Cores designed for 2:4 Structured Sparsity.In every contiguous block of 4 weights, at least 2 must be zero.
- The Hardware Hook: A dedicated compressor/decompressor unit in the L2 cache handles this on the fly.
- The Software Tax: You cannot just run standard training. You must fine-tune your model with a sparsity-inducing regularization term (like L1 norm) to force the weights into this 2:4 pattern.
If you don’t prune, you don’t get the speedup. The “20 Petaflops” number is only unlocked if you play by the hardware’s rules.
Tensor Memory (TMEM)
Perhaps the most underrated change in Blackwell is the introduction of Tensor Memory (TMEM). In the Hopper generation (H100), Tensor Cores relied heavily on Shared Memory (SMEM) and Registers to stage data. As models grew, SMEM became a bottleneck.
Blackwell adds a dedicated slab of on-chip memory specifically for tensor operations, sitting between the L2 Cache and the SMs.
- Function: It acts as a warp-synchronous buffer for matrix operands.
- Benefit: It frees up Shared Memory for other tasks (like activation storage), effectively doubling the usable SMEM bandwidth.
The NVLink Switch 7.2T: The Rack is the GPU
Finally, Blackwell effectively deprecates the single GPU as a unit of compute. With the NVLink Switch 7.2T, the 72 GPUs in a GB200 rack share a unified memory address space. Implementing models on this requires a shift from Data Parallelism (DDP) to Model Parallelism. You don’t “send data” to another GPU; you just load it from a memory address that happens to physically reside in a different tray.
Conclusion
Blackwell is a beast, but it is not a drop-in replacement. To utilize it:
- Adapt for FP4: Implement Micro-scaling calibration in your inference engine (likely via TensorRT-LLM).
- Prune for Sparsity: Retrain/finetune with 2:4 constraints.
- Rethink Memory: Optimize kernels to leverage TMEM.



