Posts by tag 'vLLM'

Feb 19, 2026 · Deep Tech

Speculative Decoding: Cheating Physics for Latency

Using a 'Draft' model costs 10% more VRAM but saves 50% Latency. Here is the mechanics of the gamble.

Dec 29, 2025 · AI at Scale

The Efficiency Moat - Navigating the New Economics of AI Inference

As the AI industry moves from model training to large-scale deployment, the strategic bottleneck has shifted from parameter count to inference orchestration. This post explores how advanced techniques like RadixAttention, Chunked Prefills, and Deep Expert Parallelism are redefining the ROI of GPU clusters and creating a new standard for high-performance AI infrastructure.

Dec 28, 2025 · AI Infrastructure

Scaling Structural Bias - Pre-training Custom Qwen3 on TPU v6e

An end-to-end guide to orchestrating Custom Qwen3 pre-training on Google Cloud's Trillium TPUs. I dive into modifying the Qwen3 architecture for structured JSON outputs, leveraging XPK for orchestration, and serving the final artifacts with vLLM's high-performance openXLA backend.

Strictly Necessary

Analytics