Posts by tag 'GKE'

Apr 20, 2026 · AI Infrastructure

Semantic Caching at Scale: Vector Embeddings for 5x Latency Reduction

Moving beyond exact-match caching for repetitive zero-shot inference workloads. Learn how to architect semantic caching to slash latency and compute costs.

Mar 12, 2026 · AI Infrastructure

HBM-Aware Load Balancing with libtpu and GKE

CPU load is a trailing indicator for AI inference. Discover how to use libtpu metrics and the GKE Gateway API to build high-density, memory-aware traffic routing for TPUs.

Feb 25, 2026 · AI Infrastructure

Stateful Agents on K8s: Redis is Your Bottleneck, Not the Vector DB

Agents are stateless. Their memory is not. Scaling the LLM reasoning loop is trivial compared to solving the transactional concurrency of agent memory on Kubernetes.

Search

Tag: GKE

Semantic Caching at Scale: Vector Embeddings for 5x Latency Reduction

HBM-Aware Load Balancing with libtpu and GKE

Stateful Agents on K8s: Redis is Your Bottleneck, Not the Vector DB

Strictly Necessary

Analytics