· Cloud · 9 min read
Stateful Agents on K8s: Redis is Your Bottleneck, Not the Vector DB
Agents are stateless. Their memory is not. Scaling the LLM reasoning loop is trivial compared to solving the transactional concurrency of agent memory on Kubernetes.

The typical tutorial on Agentic architecture ends right where the real pain begins. You follow the guide, you write your LangGraph workflow, you define your tools, and you spin up a quick Python script. You ask your agent a question in the terminal, it reasons through three steps, hits a web search API, and returns a beautiful answer.
You containerize the application into a Docker image, deploy it to a Google Kubernetes Engine (GKE) cluster, expose it behind a LoadBalancer, and you pat yourself on the back. You have built a production-ready AI agent.
Then the marketing team launches the feature. Five hundred users log in concurrently and start chatting with the agent.
Suddenly, users start reporting bizarre behavior. An agent conversing with User A suddenly replies with data extracted from User B’s private account. Connections time out. The GKE pods start hitting OOM (Out of Memory) kills. The entire system gracefully cascades into a smoking crater.
What you have is not a production-ready agent. You built a stateless web server and attempted to run highly stateful, long-running, asynchronous conversational loops inside of it.
Scaling the LLM (the brain) is relatively trivial; you just pay Vertex AI or Anthropic for more API bandwidth. Scaling the inference cluster is just a matter of adding more T4 or L4 nodes.
The real architectural bottleneck in the Agentic era is not the model. It is the memory.
The Checkpointing Problem
When you build a multi-step agent using a framework like LangGraph or Genkit, you are building a cyclical graph. The agent receives a prompt, decides to use a tool, waits for the tool to return, evaluates the tool output, and then either decides to use another tool or return to the user.
This loop can take anywhere from 10 seconds to 5 minutes depending on the latency of the external APIs it is calling.
During that 5-minute window, the Kubernetes Pod executing the Python script holds the entire conversational context in its local application memory. It holds the array of messages, the internal scratchpad of intermediate reasoning steps, and the raw JSON outputs of the tool calls.
If a GKE node is preempted to balance cluster resources, or if an aggressive autoscaler shuts the pod down, that 5-minute reasoning loop vanishes into the ether. The user stares at a spinner indefinitely.
To prevent this, agent frameworks rely on Checkpointing. After every single node execution in the graph (e.g., after the LLM thinks, and after a tool executes), the framework serializes the entire state object into a JSON blob and commits it to a database checkpoint.
If the pod dies, another pod can simply pull the last checkpoint from the database, hydrate the Python object, and resume the graph execution exactly where it left off.
This solves fault tolerance. It heavily introduces a devastating I/O bottleneck.
Why In-Memory Caches Choke
In the rush to deploy, engineering teams almost uniformly default to Redis for storing these graph checkpoints. Redis is fast, deeply integrated into Python ecosystems, and easy to deploy via a Helm chart.
But let’s do the math on the checkpoint payload.
An agent that has been conversing back and forth with a user for 30 minutes will accumulate a massive messages array. If the agent pulled in a 5,000-token context payload from a Vector Database, that payload is sitting inside the state object. A single checkpoint can easily exceed 500 kilobytes.
In a cyclic graph doing complex reasoning, an agent might transition through 10 nodes to answer a single user prompt. That means the agent is writing a 500KB JSON payload to Redis 10 times in a 30-second window.
Multiply that by 1,000 concurrent enterprise users. You are now attempting to synchronously slam a single-threaded Redis instance with gigabytes of write throughput per minute.
Your Redis cluster memory gets decimated by the sheer volume of bloated conversational state. The synchronous database write limits block the Python asyncio loops in the agent pods. Your throughput drops to zero, and the system collapses under the weight of its own memory.
You quickly realize you don’t actually have a Vector Database bottleneck. You have an ACID transaction bottleneck.
Designing the Stateful Agent Architecture
We have to architect Agent deployments not as standard web servers, but as highly concurrent transactional state machines.
Here is the blueprint for running robust, stateful agents on Google Cloud.
1. Shift from Synchronous HTTP to Asynchronous Pub/Sub
You cannot run long reasoning loops inside an HTTP request-response cycle. If the user hits a REST API endpoint like /api/chat, the HTTP connection will inevitably time out before the agent finishes its 10-step reasoning loop.
We must decouple ingestion from execution.
When the user submits a message, the frontend hits a lightweight, stateless API gateway (like Cloud Run). The gateway immediately publishes the message to a Google Cloud Pub/Sub topic (e.g., agent-inbox) and returns a HTTP 202 Accepted response with a thread_id to the frontend.
The frontend opens a WebSocket connection and subscribes to updates for that specific thread_id.
2. KEDA: Event-Driven Autoscaling
Your GKE pods should not be sitting around waiting for HTTP traffic. They should be subscribed to the agent-inbox Pub/Sub topic as queue workers.
We use KEDA (Kubernetes Event-driven Autoscaling). KEDA monitors the queue depth of the Pub/Sub topic. If 500 messages suddenly arrive, KEDA aggressively scales the GKE Deployment from 5 pods up to 50 pods.
Any available pod can pull a message off the queue. When a pod pulls a message, it uses the thread_id to fetch the checkpoint from the database, executes the graph logic until it hits a suspension point (like an LLM call or a Tool call), writes the new checkpoint to the database, and acknowledges (ACKs) the Pub/Sub message.
This pattern ensures that a pod crash only results in a Pub/Sub NACK, meaning the message goes back to the queue and another pod safely retries it. No state is lost. No HTTP connections timeout.
3. Evicting Redis for Spanner
Now we address the database. We must move the checkpoint storage off an in-memory cache and onto a highly scalable, horizontally distributed relational database.
On GCP, the gold standard for this is Cloud Spanner.
Unlike a standard single-node PostgreSQL instance which will lock up under massive write concurrency, Spanner distributes the transactional load across thousands of nodes seamlessly.
Our database schema for the Checkpointer is brutalist and simple:
CREATE TABLE AgentStateCheckpoints (
ThreadId STRING(36) NOT NULL,
CheckpointId STRING(36) NOT NULL,
Timestamp TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
GraphState JSON NOT NULL,
ParentCheckpointId STRING(36)
) PRIMARY KEY(ThreadId, CheckpointId);We do not update existing rows. The state is immutable. Every time the graph advances a step, we INSERT a completely new row with the latest comprehensive GraphState JSON block and link it to its ParentCheckpointId.
This is Event Sourcing for AI Agents. If a user wants to “undo” an agent’s action and rewind the conversation to three steps ago, we simply fetch the older CheckpointId and resume the graph iteration from that historical row.
4. Optimistic Concurrency Control (OCC)
When you run agents concurrently across dozens of pods, you introduce race conditions.
Imagine a “Multi-Agent” setup where a Researcher Agent and a Writer Agent are collaborating on the exact same thread. The Researcher pod finds a fact and attempts to save its updated messages array to the Spanner database. At the exact same millisecond, the Writer pod generates a draft and attempts to save its updated messages array for the same thread.
If the database accepts both writes blindly, the Writer pod has just overwritten and permanently deleted the Researcher’s work. This is the “Lost Update” problem.
We must enforce Optimistic Concurrency Control.
When the agent framework attempts to write a new checkpoint to Spanner, it must include an expected_parent_checkpoint_id in the SQL transaction.
The transaction runs a conditional constraint: Ensure that the most recent row in the table for this ThreadId matches our expected parent.
If it does, the write commits successfully. If it does not - meaning another pod sneaked in a write while our pod was busy reasoning - the Spanner transaction fails with an OCC constraint violation error.
The agent framework catches this error, immediately re-fetches the latest state from the database, merges the new context, recalculates its action, and tries the write again.
This mathematical rigor is what separates a fragile hackathon demo from an enterprise-grade banking agent.
5. Putting it Together with Google ADK
Manually writing Spanner OCC transactions inside every graph node is tedious. This is where Google’s Agent Development Kit (ADK) shines.
The ADK abstracts the complexity of state management, allowing you to seamlessly inject a distributed checkpointer into your agent’s core loop. When the agent acts, the ADK handles the state hydration, transaction boundaries, and optimistic concurrency retries automatically.
Here is what the actual queue worker code looks like in production:
from google.adk import Agent
from google.adk.storage.spanner import SpannerCheckpointer
from my_tools import run_sql_query, fetch_market_data
# 1. Initialize the distributed Spanner Checkpointer
checkpointer = SpannerCheckpointer(
instance_id="gcp-agent-cluster",
database_id="state-db",
table_name="AgentStateCheckpoints"
)
# 2. Define the Agent, injecting the Spanner backend for memory
analyst_agent = Agent(
name="Financial Analyst",
model="gemini-2.5-pro",
tools=[run_sql_query, fetch_market_data],
checkpointer=checkpointer
)
# 3. The KEDA-scaled Pub/Sub Worker Loop
async def process_pubsub_message(message):
thread_id = message.attributes.get("thread_id")
user_prompt = message.data.decode("utf-8")
try:
# The ADK automatically fetches the historical Spanner state using the thread_id,
# appends the new prompt, triggers the LLM reasoning loop, executes tools,
# and runs the Spanner OCC transaction to safely commit the new state.
response = await analyst_agent.run(
prompt=user_prompt,
thread_id=thread_id
)
# Broadcast the successful result to the user's connected WebSocket
await broadcast_websocket_reply(thread_id, response.text)
# Acknowledge the message to remove it from the Pub/Sub queue
message.ack()
except checkpointer.ConcurrencyError:
# OCC Exception: Another pod beat us to the write on this exact thread.
# NACK the message so it goes back to the queue and gets cleanly retried
# with the newly updated state.
message.nack()This single abstraction beautifully isolates your agent’s reasoning logic from the brutal realities of distributed systems engineering.
The Operational Reality
Building agentic infrastructure forces us to confront the reality that AI is no longer a stateless math equation. It is a long-running, deeply stateful software system.
If you treat memory as an afterthought, throwing dictionaries into a Redis container out of convenience, your system will fracture under load. If you lean on the heavy machinery of distributed systems engineering - Pub/Sub queues, event-driven autoscalers, and globally consistent transactional databases - you unlock the ability to orchestrate thousands of autonomous agents communicating flawlessly at scale.
We spent the last decade tearing down monolithic state servers to build stateless microservices. Generative AI forces us to build stateful systems all over again, but this time, the state we are managing isn’t a user profile. It is the active, working memory of a silicon intellect. Take it seriously.



