Cluster Hub

AI Infrastructure

Silicon, JAX, Networking (NCCL/Ring bottlenecks), TPUs, GPU optimization, and Deep Tech.

Cluster Articles

The CUDA Monopoly Breaks: Running Unmodified CUDA on AMD GPUs in 2026

SCALE and other CUDA-compatibility layers are cracking Nvidia's software moat, letting unmodified CUDA binaries run on AMD hardware. Here is what it means for AI inference costs and enterprise...

Serverless Inference: Conquering the 5-Second Cold Start

Serverless inference promises pay-per-request economics but the five-second cold start destroys the user experience. Here is what actually works: persistent model workers, speculative warmers, hybrid...

Data Gravity: Why Your Enterprise Data Dictates Your AI Infrastructure Choice

Your data location is no longer an afterthought. When every cloud provider promises the best AI infrastructure, the real tiebreaker is where your company's enterprise data already lives. We explore...

The Kubernetes for AI Paradigm

Native K8s orchestration is evolving to handle GPU scheduling, checkpointing, and live migration at the scale that AI demands.

Benchmarking Edge Silicon: NPU vs GPU Inference

NPUs promise efficient edge LLM inference, but how do they actually compare to discrete GPUs under real production workloads?

Inference Cost Architecture: The Hidden Economics of Token Routing

Inference cost architecture: how smart model routing between frontier and distilled models creates real margin at scale. Unit economics, production examples, and the infrastructure decisions that...

Contrarian Takes on AI Infrastructure: What the Market Gets Wrong

The dominant narrative in AI infrastructure is wrong on multiple fronts. GPU supply dynamics, neocloud pricing advantages, hardware fungibility, crawl monetization, and open weights democratization —...

The AI Capital Wall: Why GPUs Are No Longer the Scarcest Resource

AI capital wall analysis: GPUs are no longer the scarcest resource. Data center capacity, liquid cooling, and power density are the real bottlenecks for scaling AI infrastructure in 2026.

The Inference Cost Wall: When Fine-Tuning Beats Frontier API Calls

The inference cost wall in AI: analyzing the inflection point where running distilled models on neocloud infrastructure beats paying per-token for frontier models.

Serverless Inference: Conquering the 5-Second Cold Start

The infrastructure hacks required to make scale-to-zero LLM inference viable for production latency.

Hardware Acceleration for Vector DBs: Beyond CPU Constraints

Vector search has hit a physical wall. Explore why CPU-bound indexing fails at scale and how FPGAs and custom ASICs are redefining the database layer.

LiteRT-LM Deep Dive: Engineering LLM Inference for the Edge

How Google's LiteRT-LM framework handles session cloning and KV-cache management to run models like Gemini Nano natively on-device without exploding your memory.

Scaling Vector Databases for High-Throughput Text Clustering

Analyzing the bottleneck of bulk clustering and using exact-match caching to reduce index compute load.

Breaking the Bandwidth Wall: Why AI Clusters are Shifting to Ultra Ethernet

To scale past 100k GPUs, the industry is replacing proprietary InfiniBand with AI-optimized Ultra Ethernet.

Multi-Cloud GPU Arbitrage: Routing Workloads Between Hyperscalers and Neoclouds

Don't lock into one vendor. Learn how to use an abstraction layer to route training and inference workloads to the cheapest available capacity across hyperscalers and neoclouds.

Semantic Caching at Scale: Vector Embeddings for 5x Latency Reduction

Moving beyond exact-match caching for repetitive zero-shot inference workloads. Learn how to architect semantic caching to slash latency and compute costs.

Rack-Scale AI Design: The End of Component Scaling

We have hit the physical limits of what a single chip can do. The new unit of compute for AI infrastructure isn't the GPU; it's the fully integrated rack.

TTFT (Time To First Token): Measuring Inference Correctly

TTFT reveals the real bottleneck in LLM inference. Learn why Time To First Token matters more than average latency, and how to separate prefill vs decode.

Hierarchical KV Caching: Scaling Context Beyond VRAM Limits

As context windows scale to a million tokens, the KV cache becomes too large for GPU memory. The solution is a multi-tiered cache that offloads data to CPU and NVMe without killing latency.

How xAI Built Grok: Training Data and Compute Infrastructure

How xAI built Grok from training data to compute infrastructure: the JAX and Rust stack, GPU cluster architecture, and why they moved beyond PyTorch.

Demystifying Google TPU SparseCore: Accelerating Recommendation Systems

How Google TPU SparseCore solves embedding lookup bottlenecks in recommender models. Learn the co-designed architecture of Trillium's SparseCores.

AI Training Chip Performance: Real Scaling Data vs Marketing Hype (Blackwell to Hopper)

AI training chip performance data: analyzing real scaling from Hopper to Blackwell. 3.2x training, 50x inference gains, and why memory bandwidth matters more than FLOPs.

The Battle for Memory: PagedAttention vs RingAttention on Kubernetes

Comparing raw memory management strategies for infinite-context enterprise agents.

KV Cache Offloading in K8s: The Stateless Truce

Your beloved stateless Kubernetes architecture is fundamentally at war with the massive, stateful memory requirements of long-context LLM inference. We need a truce.

vLLM Continuous Batching: How PagedAttention Optimizes GPU Throughput

vLLM continuous batching and PagedAttention explained: see how dynamic KV cache allocation eliminates memory fragmentation and boosts GPU throughput by 3x–5x.

Deploying Agentic AI as a Service (AaaS)

Deep dive into deploying agentic ai as a service (aaas).

Speculative Decoding Infrastructure: Squeezing Latency without Hardware Upgrades

The bottleneck for LLMs is memory bandwidth, not compute. Discover how to use speculative decoding on GCP to achieve 3x speedups by using small "draft" models to accelerate massive "oracle" models.

HBM-Aware Load Balancing with libtpu and GKE

CPU load is a trailing indicator for AI inference. Discover how to use libtpu metrics and the GKE Gateway API to build high-density, memory-aware traffic routing for TPUs.

Beyond Vibe-Checks: Trajectory Evaluation & Synthetic Adversaries

Is your agent actually reasoning, or just lucky? Discover why trajectory analysis and synthetic red-teaming are the only ways to build production-grade autonomous systems.

Stateful Agents on K8s: Redis is Your Bottleneck, Not the Vector DB

Agents are stateless. Their memory is not. Scaling the LLM reasoning loop is trivial compared to solving the transactional concurrency of agent memory on Kubernetes.

JAX Pallas: Writing GPU Kernels for Maximum Performance

JAX Pallas is NVIDIA's GPU programming API for high-performance compute kernels. Write optimized kernels for matrix multiplication and memory access patterns.

Single-Batch Inference: Speculative Decoding on an A100

See how speculative decoding performs for single-batch requests on an NVIDIA A100. We analyze acceptance rates, latency, and the mechanics of the draft model gamble.

My Profiling Nightmare: The Warp Stall

A war story of chasing a 5ms latency spike to a single loose thread. How to read Nsight Systems and spot Warp Divergence.

JAX XLA: Why Your GPU is Idle 40% of the Time

Recompilation is the silent killer of training throughput. If you see 'Jit' in your profiler, you are losing money. We dive into XLA internals.

The Compute-to-Cashflow Gap

The AI industry is shifting from celebrating large compute budgets to hunting for efficiency. Your competitive advantage is no longer your GPU count, but your cost-per-inference.

AI Quantization and Hardware Co-Design

Explore how quantization and hardware co-design overcome memory bottlenecks, comparing NVIDIA and Google architectures while looking toward the 1-bit future of efficient AI model development.

Network Jitter: The Silent Killer of Training

In distributed training, the slowest packet determines the speed of the cluster. We benchmark GCP's 'Circuit Switched' Jupiter fabric against AWS's 'Multipath' SRD protocol.

The Efficiency Moat - Navigating the New Economics of AI Inference

As the AI industry moves from model training to large-scale deployment, the strategic bottleneck has shifted from parameter count to inference orchestration. This post explores how advanced...

Business Case for JAX: JAX vs Custom C+ AI Training Stack Performance

Business case for JAX in AI training: compare JAX vs custom C++ training stack performance. See how compiler-first JAX reduces data movement overhead and improves throughput by 2.7x.

Scaling Structural Bias - Pre-training Custom Qwen3 on TPU v6e

An end-to-end guide to orchestrating Custom Qwen3 pre-training on Google Cloud's Trillium TPUs. I dive into modifying the Qwen3 architecture for structured JSON outputs, leveraging XPK for...

Why More GPUs Is No Longer a Viable Strategy in 2026

As hardware lead times and power constraints hit a ceiling, the competitive advantage in AI has shifted from chip volume to architectural efficiency. This article explores how JAX, Pallas, and...

Layered improvements with G4 / RTX 6000 Pro

Google Cloud’s G4 architecture delivers 168% higher throughput by maximizing PCIe Gen 5 performance. This deep dive examines the engineering stack driving these gains, from direct P2P communication...

Getting most out of your GPUs using MIG

Understanding how to partition a single GPU into multiple isolated instances for cost-efficient AI workloads, with a deep dive into NVIDIA's MIG technology and the architectural differences between...

Why do large enterprises need a Chief AI Officer?

As organizations pivot from AI experimentation to enterprise-scale deployment, a recurring structural friction often emerges. Through my engagements with leadership teams in APAC, it has become clear...

Network Design for AI Workloads

Generative AI has shifted data center traffic patterns, making network performance the new bottleneck for model training. This post contrasts how the "Big Three" cloud providers utilize distinct...

Not All Zeros Are the Same - Sparsity Explained

Demystifying hardware acceleration and the competing sparsity philosophies of Google TPUs and Nvidia. This post connects novel architectures, like Mixture-of-Experts, to hardware design strategy and...

Stop Chasing Leaderboards - Focus on what actually matters.

AI benchmarks are fundamentally broken, putting enterprise budgets at risk. This post deconstructs the technical flaws and outlines a strategy for building internal evaluations that actually predict...

Switching Technologies in AI Accelerators

This post contrasts the switching technologies of NVIDIA and Google's TPUs. Understanding their different approaches is key to matching modern AI workloads, which demand heavy data movement, to the...

Generality vs. Specialization - The Real Difference Between GPUs and TPUs

It's not just about specs. This post breaks down the core trade-off between the GPU's versatile power and the TPU's hyper-efficient, specialized design for AI workloads.

Beyond the Hype - An Executive’s Guide to Realizing Value with Agentic AI

A guide for technology executives on how to move beyond proofs-of-concept and realize sustainable, transformative value from agentic AI by focusing on business-first strategies.

The Case for SparseCore

Large-scale recommendation models involve a two-part process. First, a "sparse lookup" phase retrieves data from memory, a task that is challenging for standard GPUs. Second, a "dense computation"...

The theory behind Technical Debt

Technical debt is not new, This weekend I went down the trail to read-up on its impact due to the increased throughput of code generation thanks to AI. Turns out AI code generation is a double-edged...

AI Infrastructure

Cluster Articles

Strictly Necessary

Analytics