Search

· Agentic AI  · 5 min read

Handling Context Window Limits in Multi-Agent Loops

Architectural patterns for summarizing, pruning, and passing context between collaborative subagents without hitting OOM errors.

Featured image for: Handling Context Window Limits in Multi-Agent Loops
Key Takeaways
  • Passing the full conversation history to every subagent in a loop will rapidly exhaust your context window and degrade reasoning quality.
  • The "Summarizer Agent" pattern acts as a garbage collector, compressing raw transcripts into structured state before passing it down the chain.
  • Implementing a semantic memory layer allows agents to fetch only the relevant context for the current step, rather than holding everything in active memory.
  • State management frameworks like LangGraph require explicit pruning functions to maintain stability in infinite or long-running loops.

One of the most common mistakes engineering teams make when building their first multi-agent system is treating the context window like an infinite hard drive. They build a loop in LangGraph or AutoGen, and at the end of every node, they simply append the latest output to a massive messages array.

For the first few loops, everything is magical. The agents are communicating, the plan is executing, and the system feels alive. But around loop 15, the API starts returning 400 errors. You have blown past the 128k or 256k token limit. Even worse, long before you hit the hard error, the model starts hallucinating. It loses the plot, fixates on irrelevant details from loop 3, and gets trapped in a repetitive spiral.

This is the reality of the “Lost in the Middle” phenomenon. Just because a model can ingest 1 million tokens does not mean it can effectively reason across them in a single pass. When architecting multi-agent loops, you must design explicit mechanisms for context pruning, summarization, and state compression.

The Context Balloon Problem

To understand the solution, we need to look at how state is passed in a standard agentic loop.

In a framework like LangGraph, you define a graph where nodes are agents and edges are the transitions between them. The state object is passed from node to node. The default configuration usually involves a messages key, which uses an append-only reducer.

# The classic trap: an append-only state definition
class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], operator.add]
    current_task: str

If Agent A writes a 2000-token research report, and Agent B writes a 1500-token critique, that is 3500 tokens added to the state. When control passes back to Agent A for revisions, it has to process the entire 3500-token history, plus its own system prompt, plus its new output. The token count compounds exponentially with every transition.

This is the Context Balloon. It inflates until it pops your API budget and your model’s attention mechanism.

Pattern 1: The Summarizer Agent

The most robust way to handle this is to treat context the way an operating system treats RAM. You need a garbage collector. In multi-agent systems, this takes the form of a Summarizer Agent.

Instead of passing the raw message history directly back to the planning node, you route the execution graph through a dedicated node whose sole job is compression. This node takes the last N messages, extracts the concrete decisions, actions taken, and current blockers, and overwrites the active history with a dense, structured summary.

Explainer Diagram Explainer Diagram: A state machine flowchart demonstrating the Summarizer pattern. Raw interaction logs from Worker nodes are routed to a Compression Node, which distills the data into a strict JSON schema update before passing control back to the Orchestrator.

In your LangGraph state, you stop using the append-only operator.add for the entire history. Instead, you maintain two separate keys: recent_messages (which is cleared after every major cycle) and compressed_state (which holds the running summary).

# A resilient state definition
class CompressedState(TypedDict):
    recent_messages: list[BaseMessage] # Overwritten every cycle
    running_summary: str
    resolved_tasks: list[str]

When the Summarizer runs, it looks at recent_messages, updates running_summary, and then empties recent_messages. This guarantees that the token payload passed to the heavy reasoning agents remains relatively constant, no matter how many loops the system executes.

Pattern 2: Semantic Memory Fetching

Summarization works perfectly for sequential logic, but what if an agent on loop 20 needs a highly specific piece of code generated back in loop 2? A running summary will likely drop those fine-grained details to save space.

This is where you implement a semantic memory layer.

Instead of keeping the code in the active context window, the agent that generated the code must explicitly save it to an external vector database or a structured key-value store (like Redis). We covered this in detail in our implementation of the Blackboard Architecture.

When the downstream agent needs that code, it doesn’t look in the conversation history. It uses a tool call: fetch_memory(query="auth service implementation"). The active context window is kept entirely free of static data. It only holds the pointers and the current reasoning steps.

This mirrors human cognition. You do not keep the entire text of a book you read last year in your working memory. You remember the concepts, and you know how to query a search engine or look at your bookshelf when you need the exact quote.

Enforcing Strict Boundaries

Building resilient multi-agent systems requires a mental shift. You are no longer just writing prompts; you are managing a highly volatile data pipeline.

Every time you add a new subagent to your graph, you must ask yourself: What is the minimum viable context this agent needs to execute its specific task? A Python execution agent does not need to know the strategic business justification for the application it is building. It only needs the function signature and the current error trace.

Filter the state before you pass it. Use selector functions on your graph edges to strip out irrelevant keys. If you enforce strict boundaries on what each agent can see, your token costs will plummet, your latency will drop, and your multi-agent loops will finally run stable in production.

Back to Blog

Related Posts

View All Posts »