What Is Context Management in AI Agents?
Quick Definition#
Context management is the set of techniques for controlling what information occupies an AI agent's context window across multiple reasoning steps. As agents run multi-step tasks, their context accumulates conversation history, tool results, intermediate findings, and system instructions. Without management, the context window fills up or becomes so noisy that the agent loses focus. Good context management ensures the agent always has the most relevant information available — no more, no less.
Browse all AI agent terms in the AI Agent Glossary. For the context window limits being managed, see Context Window. For persistent storage beyond the window, see AI Agent Memory.
Why Context Management Matters#
Every LLM has a fixed context window — a maximum number of tokens it can process in one call. For GPT-4o, it is 128K tokens. For Claude, up to 200K. This sounds large, but long-running agents can exhaust it:
- A research agent searching 20 web pages accumulates 50K–100K tokens of raw content
- A multi-day coding project has hundreds of messages and code blocks
- A customer support agent with rich conversation history and product documentation
Beyond raw token limits, context quality degrades as it grows. Research has shown that LLMs attend less reliably to information in the middle of very long contexts (the "lost in the middle" problem). An agent with a cluttered 100K-token context will often perform worse than one with a focused 20K-token context on the same task.
Core Context Management Strategies#
1. Selective Retention#
Only keep tool results and context that are still relevant to remaining steps:
class SelectiveContextManager:
def __init__(self, remaining_steps: list[str]):
self.remaining_steps = remaining_steps
def filter_context(self, accumulated_results: list[dict]) -> list[dict]:
"""Keep only results relevant to remaining work."""
relevant = []
for result in accumulated_results:
# Check if any remaining step needs this result
if any(keyword in result.get("tags", [])
for step in self.remaining_steps
for keyword in step.split()):
relevant.append(result)
return relevant
2. Summarization#
Compress old content into condensed summaries when it is still needed but takes too many tokens:
from anthropic import Anthropic
client = Anthropic()
def summarize_tool_results(results: list[str], max_tokens: int = 500) -> str:
"""Compress multiple tool results into a concise summary."""
combined = "\n\n".join(results)
# Only summarize if content is large
if len(combined.split()) < 200:
return combined
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=max_tokens,
messages=[{
"role": "user",
"content": f"""Summarize these research findings into a concise digest.
Preserve all key facts, names, numbers, and conclusions.
Content to summarize:
{combined}"""
}]
)
return response.content[0].text
class SummarizingAgent:
def __init__(self, summarize_after: int = 5):
self.results = []
self.summarized_context = ""
self.summarize_after = summarize_after
def add_result(self, result: str):
self.results.append(result)
# Summarize old results when buffer fills
if len(self.results) >= self.summarize_after:
self.summarized_context += "\n" + summarize_tool_results(self.results)
self.results = [] # Clear buffer after summarizing
def get_context(self) -> str:
"""Return current working context."""
parts = []
if self.summarized_context:
parts.append(f"Summary of prior work:\n{self.summarized_context}")
if self.results:
parts.append("Recent findings:\n" + "\n".join(self.results))
return "\n\n".join(parts)
3. Sliding Window#
Maintain a rolling window of the most recent N messages:
class SlidingWindowContext:
def __init__(self, max_messages: int = 20, always_keep: int = 3):
self.messages = []
self.max_messages = max_messages
self.always_keep = always_keep # Always keep first N messages (system prompt, initial task)
def add_message(self, message: dict):
self.messages.append(message)
self._trim()
def _trim(self):
if len(self.messages) <= self.max_messages:
return
# Keep first always_keep messages + most recent messages
pinned = self.messages[:self.always_keep]
recent = self.messages[self.always_keep:][-( self.max_messages - self.always_keep):]
self.messages = pinned + recent
def get_messages(self) -> list[dict]:
return self.messages
4. Retrieval-Augmented Context Injection#
Store information in a vector database and retrieve only what is relevant to the current step:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
class RetrievalContextManager:
def __init__(self):
self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
self.doc_ids = []
def store_result(self, content: str, metadata: dict = None):
"""Store a tool result for later retrieval."""
self.vectorstore.add_texts([content], metadatas=[metadata or {}])
def retrieve_relevant(self, current_query: str, k: int = 3) -> list[str]:
"""Retrieve most relevant stored results for the current reasoning step."""
results = self.vectorstore.similarity_search(current_query, k=k)
return [doc.page_content for doc in results]
def build_step_context(self, current_task: str) -> str:
"""Build focused context for current step by retrieving what is relevant."""
relevant_results = self.retrieve_relevant(current_task)
return "\n\n".join(relevant_results)
5. Context in LangGraph#
LangGraph provides explicit state-based context control:
from langgraph.graph import StateGraph
from typing import TypedDict, List, Annotated
import operator
class ResearchState(TypedDict):
query: str
# Accumulate results (operator.add appends)
raw_results: Annotated[List[str], operator.add]
# Replace summary on each update
context_summary: str
# Track what we still need
remaining_tasks: List[str]
def compress_context_node(state: ResearchState) -> dict:
"""Summarize raw_results when they get large."""
if sum(len(r) for r in state["raw_results"]) > 10000:
summary = summarize_tool_results(state["raw_results"])
return {
"context_summary": state["context_summary"] + "\n" + summary,
"raw_results": [] # Clear raw results after summarizing
}
return {}
Context Budget Monitoring#
Proactively monitor token usage to avoid hitting limits:
import tiktoken
def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def build_context_within_budget(
messages: list[dict],
system_prompt: str,
budget: int = 100000
) -> list[dict]:
"""Trim message history to fit within token budget."""
system_tokens = estimate_tokens(system_prompt)
available = budget - system_tokens - 2000 # Reserve 2K for response
# Always include the first and last messages
if not messages:
return messages
first = messages[0]
last = messages[-1]
middle = messages[1:-1]
first_tokens = estimate_tokens(str(first))
last_tokens = estimate_tokens(str(last))
remaining_budget = available - first_tokens - last_tokens
# Add as many middle messages as budget allows (most recent first)
included_middle = []
for msg in reversed(middle):
tokens = estimate_tokens(str(msg))
if tokens > remaining_budget:
break
included_middle.insert(0, msg)
remaining_budget -= tokens
return [first] + included_middle + [last]
Common Misconceptions#
Misconception: Larger context windows eliminate the need for context management Even with 200K-token windows, context management remains important. The "lost in the middle" attention problem means that models attend less reliably to information buried in large contexts. A focused 30K-token context often outperforms an unmanaged 150K-token one on the same task.
Misconception: Summarization always loses information Good summarization retains all key facts, figures, and conclusions while discarding verbosity. For most agent tasks, a well-written 500-token summary of 5000 tokens of raw search results is more useful than the full raw content — both because it fits better in context and because the model focuses on what matters.
Misconception: Context management is only needed for very long tasks Even agents with 5–10 tool calls benefit from selective retention — discarding tool results that are no longer relevant to remaining steps. The benefit is not just avoiding limits but improving signal-to-noise ratio.
Related Terms#
- Context Window — The model limit being managed
- Agent State — The structured data alongside which context lives
- AI Agent Memory — Long-term storage that complements context management
- Agent Loop — The execution cycle where context accumulates
- Agentic Workflow — Multi-step workflows requiring careful context management
- Understanding AI Agent Architecture — Architecture tutorial covering memory and context management patterns
- CrewAI vs LangChain — Comparing how different frameworks approach context management
Frequently Asked Questions#
What is context management in AI agents?#
Context management is the practice of controlling what information is present in an AI agent's context window at each step. As agents run multi-step tasks, their context accumulates history, tool results, and intermediate findings. Without management, context grows until it hits limits or becomes too noisy for the model to focus effectively.
Why does context management matter for long-running agents?#
Long-running agents face two compounding problems as context grows: token limits (most models cap at 128K–200K tokens) and attention degradation (models attend less reliably to information buried in large contexts). Without active management, agents drift from their goals, repeat steps, and produce inconsistent results.
What are the main context management strategies?#
The main strategies are selective retention (keep only what is still relevant), summarization (compress old results into condensed digests), retrieval-augmented injection (store in a vector database and retrieve when relevant), sliding window (rolling window of recent context), and hierarchical memory (separate working memory from session history).
How do agent frameworks handle context management?#
LangGraph provides explicit state schemas with control over what persists. LangChain offers ConversationSummaryMemory and ConversationBufferWindowMemory. The OpenAI Assistants API manages thread context automatically with server-side truncation. Most production agents implement custom context management tailored to their task structure and token budget requirements.