One of the most frustrating errors in Agent Engineering is:
BadRequestError: 400 Context Window Exceeded.
It usually happens at the worst possible time—right when your agent is 50 steps into a complex task and has built up a valuable history of reasoning.
As we scaled our agents to handle deeper research tasks, this error became our #1 blocker. This post details how we solved it through a multi-layered defense strategy.
The Paradigm Shift: Why Now?
Prior to adopting LangGraph 1.0, our systems primarily ran on fixed Workflows. In a workflow, you (the engineer) control the state. You query the database, you format the prompt. The context size is predictable.
However, when we pivoted to the ReAct Agent pattern (Agent + Tools), everything changed.
In this world, the Agent is in the driver’s seat. It might decide to read a 50-page PDF, then search for 20 images, then list 1000 files.
The “Context” became dynamic and volatile, and managing it became a survival necessity rather than an optimization.
The “Invisible” Token Problem
Managing this dynamic context is harder than it looks, especially with Claude.
1. No “Tiktoken” for Claude
OpenAI has tiktoken, allowing precise offline token counting. Claude does not.
LangGraph’s default fallback for estimating tokens is rough: len(repr(string)) / 4.
This is often “good enough” for text, but catastrophic for Multimodal Content.
2. The Multimodal “Black Hole”
Consider an Agent analyzing a video. The message payload might look like this:
{"type": "image_url", "url": "https://..."}
String length: ~50 characters. Actual Token Cost: 1,000+ tokens.
LangGraph’s default counter sees 50 chars and thinks “Safe!”. The API sees 1000 tokens and crashes. This discrepancy caused our agents to frequently hit “Context Window Exceeded” errors that our pre-flight checks completely missed.
Layer 1: Isolation & Interception
We realized we couldn’t just “stuff everything into the prompt.” We needed to isolate heavy data.
Large Tool Output Interception
We utilized the FilesystemMiddleware to act as a gatekeeper.
If a tool (e.g., read_file, ls) returns a result larger than a threshold (e.g., 20KB), it is intercepted and not returned as raw text to the Agent.
Instead, the system:
- Saves the content to a
large_tool_results/directory. - Returns the File Path to the Agent.
# The Agent sees this instead of the raw data
ToolOutput: "Result saved to /tmp/large_results/data_chunk_42.txt. Use 'read_file' to inspect specific lines."
This forces the Agent to be economical—reading only the header or writing code to process the file—rather than polluting its context window.
Sub-Agent Isolation
For “Exploratory” tasks (like browsing 20 websites), we delegate to a Sub-Agent. The Sub-Agent burns its own context window, summarizes the findings, and returns only the summary to the Master Agent.
Layer 2: Dynamic Correction & Pre-emptive Summarization
Since we can’t count Claude tokens perfectly offline, we use a Self-Correcting Estimator that triggers cleanup before we hit the limit.
- Baseline Alignment: We track the
usagemetadata returned by the API after every turn to re-align our baseline. - The 170k Threshold: For a 200k context window, we set a proactive threshold at around 170k tokens.
- Pre-emptive Summarization: If our estimate (Baseline + New Delta) exceeds this threshold, the middleware automatically triggers a summarization of the middle-history before sending the request to the LLM.
This proactive approach handles over 90% of context issues, ensuring the LLM always receives a manageable payload.
Layer 3: Exception-Catching & Emergency Recovery
Despite preventive measures, estimations can still drift—especially with complex multimodal inputs. This is where our Exception-Handling Middleware acts as the final safety net.
If the estimation fails and the LLM returns a 400 error, the middleware catches the exception, performs an Emergency Summarization, and replays the request.
This two-stage protection (Proactive Check + Reactive Catch) makes context management virtually bulletproof.
Layer 4: UX & Optimization
Anthropic Prompt Caching
We leverage Prompt Caching (added in Jan) to mitigate the cost of these large contexts. By keeping our System Prompt and early history static, we hit the cache, making recovery faster and cheaper.
Frontend Rollback
Finally, if the Agent gets stuck in a loop or the context is irretrievably broken, we give control to the user. Our UI allows users to Rollback to a previous checkpoint—effectively “rewinding time” to before the overflow occurred—and manually trigger a summary or change the instruction.
Conclusion
Reliability in Agentic systems isn’t about perfectly predicting token usage; it’s about building a robust system that can Isolate heavy data, Correct its own estimates, and Heal itself when limits are breached.