· Engineering · 3 min read
Blackwell's Sparse Attention Engines: The Reality of FP4
FP4 isn't just 'lower precision' - it requires a fundamental rethink of activation outliers. We dive into the bit-level implementation of NVFP4, Micro-Tensor Scaling, and the new Tensor Memory hierarchy.
Beyond the Hype: The “20 Petaflops” Asterisk
When Jensen Huang announced the Blackwell B200 with “20 Petaflops” of AI performance, the engineering world cheered. But that number comes with a significant caveat: it requires FP4 precision. For context, most production LLMs today run on FP16 or BF16.
If you strictly quantize a modern LLM (like Llama 3) to 4 bits, it collapses. The “outlier features” - specific neurons that activate 100x stronger than their neighbors - get clipped, and the model turns to gibberish.
Blackwell solves this not with better math, but with better hardware: the Second-Gen Transformer Engine.
The Physics of NVFP4: Micro-Tensor Scaling
Standard quantization techniques (like INT8) often use Per-Tensor or Per-Channel scaling. This means an entire row of the weight matrix shares a single scaling factor.
- Problem: If one weight is 100.0 and the rest are 0.1, the scaler accommodates the 100.0, and the 0.1s get crushed to zero. Precision is lost.
Blackwell introduces Micro-Tensor Scaling (based on the OCP Microscaling format). Instead of one scaler for the whole tensor, Blackwell divides the matrix into blocks of 16 elements.
- Block Size: 16 elements.
- Format: Each block gets a shared 8-bit exponent (E8M0).
- Payload: Check individual weights are 4-bit.
Key Insight: This allow the hardware to “zoom in” on the quiet parts of the matrix and “zoom out” for the loud outliers dynamically. It adapts the dynamic range 16 times more frequently than H100 could.
2:4 Structured Sparsity: The Hardware Tax
Blackwell doesn’t just want small numbers; it wants fewer numbers. The architecture includes Sparse Tensor Cores designed for 2:4 Structured Sparsity.In every contiguous block of 4 weights, at least 2 must be zero.
- The Hardware Hook: A dedicated compressor/decompressor unit in the L2 cache handles this on the fly.
- The Software Tax: You cannot just run standard training. You must fine-tune your model with a sparsity-inducing regularization term (like L1 norm) to force the weights into this 2:4 pattern.
If you don’t prune, you don’t get the speedup. The “20 Petaflops” number is only unlocked if you play by the hardware’s rules.
Tensor Memory (TMEM)
Perhaps the most underrated change in Blackwell is the introduction of Tensor Memory (TMEM). In the Hopper generation (H100), Tensor Cores relied heavily on Shared Memory (SMEM) and Registers to stage data. As models grew, SMEM became a bottleneck.
Blackwell adds a dedicated slab of on-chip memory specifically for tensor operations, sitting between the L2 Cache and the SMs.
- Function: It acts as a warp-synchronous buffer for matrix operands.
- Benefit: It frees up Shared Memory for other tasks (like activation storage), effectively doubling the usable SMEM bandwidth.
The NVLink Switch 7.2T: The Rack is the GPU
Finally, Blackwell effectively deprecates the single GPU as a unit of compute. With the NVLink Switch 7.2T, the 72 GPUs in a GB200 rack share a unified memory address space. Implementing models on this requires a shift from Data Parallelism (DDP) to Model Parallelism. You don’t “send data” to another GPU; you just load it from a memory address that happens to physically reside in a different tray.
Conclusion
Blackwell is a beast, but it is not a drop-in replacement. To utilize it:
- Adapt for FP4: Implement Micro-scaling calibration in your inference engine (likely via TensorRT-LLM).
- Prune for Sparsity: Retrain/finetune with 2:4 constraints.
- Rethink Memory: Optimize kernels to leverage TMEM.



