What happens when an agent exceeds its context window?

The agent cannot process input that exceeds the context limit — the API will return an error. Before hitting the limit, agents typically summarize older conversation history, use RAG to retrieve relevant past context on demand, or split tasks into smaller segments.

What are the most effective context management techniques?

Effective techniques include sliding window (drop oldest messages), summarization (compress history into summaries), hierarchical memory (short-term + long-term stores), selective retrieval (fetch relevant context using RAG), and message importance scoring to prioritize what to keep.

Organized filing system representing structured context management — Photo by Samuel Zeller on Unsplash

What Is Context Management in AI Agents?

Q: What is context management in AI agents?

Context management is the set of techniques used to control what information is included in each LLM inference call. Since context windows are finite and costly, agents must selectively include the most relevant information while discarding or compressing less important content.

Quick Definition#

Context management is the set of techniques for controlling what information occupies an AI agent's context window across multiple reasoning steps. As agents run multi-step tasks, their context accumulates conversation history, tool results, intermediate findings, and system instructions. Without management, the context window fills up or becomes so noisy that the agent loses focus. Good context management ensures the agent always has the most relevant information available — no more, no less.

Browse all AI agent terms in the AI Agent Glossary. For the context window limits being managed, see Context Window. For persistent storage beyond the window, see AI Agent Memory.

Why Context Management Matters#

Every LLM has a fixed context window — a maximum number of tokens it can process in one call. For GPT-4o, it is 128K tokens. For Claude, up to 200K. This sounds large, but long-running agents can exhaust it:

A research agent searching 20 web pages accumulates 50K–100K tokens of raw content
A multi-day coding project has hundreds of messages and code blocks
A customer support agent with rich conversation history and product documentation

Beyond raw token limits, context quality degrades as it grows. Research has shown that LLMs attend less reliably to information in the middle of very long contexts (the "lost in the middle" problem). An agent with a cluttered 100K-token context will often perform worse than one with a focused 20K-token context on the same task.

Core Context Management Strategies#

1. Selective Retention#

Only keep tool results and context that are still relevant to remaining steps:

class SelectiveContextManager:
    def __init__(self, remaining_steps: list[str]):
        self.remaining_steps = remaining_steps

    def filter_context(self, accumulated_results: list[dict]) -> list[dict]:
        """Keep only results relevant to remaining work."""
        relevant = []
        for result in accumulated_results:
            # Check if any remaining step needs this result
            if any(keyword in result.get("tags", [])
                   for step in self.remaining_steps
                   for keyword in step.split()):
                relevant.append(result)
        return relevant

2. Summarization#

Compress old content into condensed summaries when it is still needed but takes too many tokens:

from anthropic import Anthropic

client = Anthropic()

def summarize_tool_results(results: list[str], max_tokens: int = 500) -> str:
    """Compress multiple tool results into a concise summary."""
    combined = "\n\n".join(results)

    # Only summarize if content is large
    if len(combined.split()) < 200:
        return combined

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=max_tokens,
        messages=[{
            "role": "user",
            "content": f"""Summarize these research findings into a concise digest.
Preserve all key facts, names, numbers, and conclusions.

Content to summarize:
{combined}"""
        }]
    )
    return response.content[0].text

class SummarizingAgent:
    def __init__(self, summarize_after: int = 5):
        self.results = []
        self.summarized_context = ""
        self.summarize_after = summarize_after

    def add_result(self, result: str):
        self.results.append(result)
        # Summarize old results when buffer fills
        if len(self.results) >= self.summarize_after:
            self.summarized_context += "\n" + summarize_tool_results(self.results)
            self.results = []  # Clear buffer after summarizing

    def get_context(self) -> str:
        """Return current working context."""
        parts = []
        if self.summarized_context:
            parts.append(f"Summary of prior work:\n{self.summarized_context}")
        if self.results:
            parts.append("Recent findings:\n" + "\n".join(self.results))
        return "\n\n".join(parts)

3. Sliding Window#

Maintain a rolling window of the most recent N messages:

class SlidingWindowContext:
    def __init__(self, max_messages: int = 20, always_keep: int = 3):
        self.messages = []
        self.max_messages = max_messages
        self.always_keep = always_keep  # Always keep first N messages (system prompt, initial task)

    def add_message(self, message: dict):
        self.messages.append(message)
        self._trim()

    def _trim(self):
        if len(self.messages) <= self.max_messages:
            return
        # Keep first always_keep messages + most recent messages
        pinned = self.messages[:self.always_keep]
        recent = self.messages[self.always_keep:][-( self.max_messages - self.always_keep):]
        self.messages = pinned + recent

    def get_messages(self) -> list[dict]:
        return self.messages

4. Retrieval-Augmented Context Injection#

Store information in a vector database and retrieve only what is relevant to the current step:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

class RetrievalContextManager:
    def __init__(self):
        self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
        self.doc_ids = []

    def store_result(self, content: str, metadata: dict = None):
        """Store a tool result for later retrieval."""
        self.vectorstore.add_texts([content], metadatas=[metadata or {}])

    def retrieve_relevant(self, current_query: str, k: int = 3) -> list[str]:
        """Retrieve most relevant stored results for the current reasoning step."""
        results = self.vectorstore.similarity_search(current_query, k=k)
        return [doc.page_content for doc in results]

    def build_step_context(self, current_task: str) -> str:
        """Build focused context for current step by retrieving what is relevant."""
        relevant_results = self.retrieve_relevant(current_task)
        return "\n\n".join(relevant_results)

5. Context in LangGraph#

LangGraph provides explicit state-based context control:

from langgraph.graph import StateGraph
from typing import TypedDict, List, Annotated
import operator

class ResearchState(TypedDict):
    query: str
    # Accumulate results (operator.add appends)
    raw_results: Annotated[List[str], operator.add]
    # Replace summary on each update
    context_summary: str
    # Track what we still need
    remaining_tasks: List[str]

def compress_context_node(state: ResearchState) -> dict:
    """Summarize raw_results when they get large."""
    if sum(len(r) for r in state["raw_results"]) > 10000:
        summary = summarize_tool_results(state["raw_results"])
        return {
            "context_summary": state["context_summary"] + "\n" + summary,
            "raw_results": []  # Clear raw results after summarizing
        }
    return {}

Context Budget Monitoring#

Proactively monitor token usage to avoid hitting limits:

import tiktoken

def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def build_context_within_budget(
    messages: list[dict],
    system_prompt: str,
    budget: int = 100000
) -> list[dict]:
    """Trim message history to fit within token budget."""
    system_tokens = estimate_tokens(system_prompt)
    available = budget - system_tokens - 2000  # Reserve 2K for response

    # Always include the first and last messages
    if not messages:
        return messages

    first = messages[0]
    last = messages[-1]
    middle = messages[1:-1]

    first_tokens = estimate_tokens(str(first))
    last_tokens = estimate_tokens(str(last))
    remaining_budget = available - first_tokens - last_tokens

    # Add as many middle messages as budget allows (most recent first)
    included_middle = []
    for msg in reversed(middle):
        tokens = estimate_tokens(str(msg))
        if tokens > remaining_budget:
            break
        included_middle.insert(0, msg)
        remaining_budget -= tokens

    return [first] + included_middle + [last]

Common Misconceptions#

Misconception: Larger context windows eliminate the need for context management Even with 200K-token windows, context management remains important. The "lost in the middle" attention problem means that models attend less reliably to information buried in large contexts. A focused 30K-token context often outperforms an unmanaged 150K-token one on the same task.

Misconception: Summarization always loses information Good summarization retains all key facts, figures, and conclusions while discarding verbosity. For most agent tasks, a well-written 500-token summary of 5000 tokens of raw search results is more useful than the full raw content — both because it fits better in context and because the model focuses on what matters.

Misconception: Context management is only needed for very long tasks Even agents with 5–10 tool calls benefit from selective retention — discarding tool results that are no longer relevant to remaining steps. The benefit is not just avoiding limits but improving signal-to-noise ratio.

Context Window — The model limit being managed
Agent State — The structured data alongside which context lives
AI Agent Memory — Long-term storage that complements context management
Agent Loop — The execution cycle where context accumulates
Agentic Workflow — Multi-step workflows requiring careful context management
Understanding AI Agent Architecture — Architecture tutorial covering memory and context management patterns
CrewAI vs LangChain — Comparing how different frameworks approach context management

Frequently Asked Questions#

What is context management in AI agents?#

Context management is the practice of controlling what information is present in an AI agent's context window at each step. As agents run multi-step tasks, their context accumulates history, tool results, and intermediate findings. Without management, context grows until it hits limits or becomes too noisy for the model to focus effectively.

Why does context management matter for long-running agents?#

Long-running agents face two compounding problems as context grows: token limits (most models cap at 128K–200K tokens) and attention degradation (models attend less reliably to information buried in large contexts). Without active management, agents drift from their goals, repeat steps, and produce inconsistent results.

What are the main context management strategies?#

The main strategies are selective retention (keep only what is still relevant), summarization (compress old results into condensed digests), retrieval-augmented injection (store in a vector database and retrieve when relevant), sliding window (rolling window of recent context), and hierarchical memory (separate working memory from session history).

How do agent frameworks handle context management?#

LangGraph provides explicit state schemas with control over what persists. LangChain offers ConversationSummaryMemory and ConversationBufferWindowMemory. The OpenAI Assistants API manages thread context automatically with server-side truncation. Most production agents implement custom context management tailored to their task structure and token budget requirements.

What Is Context Management in AI Agents?

Quick Definition#

Browse all AI agent terms in the AI Agent Glossary. For the context window limits being managed, see Context Window. For persistent storage beyond the window, see AI Agent Memory.

Why Context Management Matters#

A research agent searching 20 web pages accumulates 50K–100K tokens of raw content
A multi-day coding project has hundreds of messages and code blocks
A customer support agent with rich conversation history and product documentation

Core Context Management Strategies#

1. Selective Retention#

Only keep tool results and context that are still relevant to remaining steps:

class SelectiveContextManager:
    def __init__(self, remaining_steps: list[str]):
        self.remaining_steps = remaining_steps

    def filter_context(self, accumulated_results: list[dict]) -> list[dict]:
        """Keep only results relevant to remaining work."""
        relevant = []
        for result in accumulated_results:
            # Check if any remaining step needs this result
            if any(keyword in result.get("tags", [])
                   for step in self.remaining_steps
                   for keyword in step.split()):
                relevant.append(result)
        return relevant

2. Summarization#

Compress old content into condensed summaries when it is still needed but takes too many tokens:

from anthropic import Anthropic

client = Anthropic()

def summarize_tool_results(results: list[str], max_tokens: int = 500) -> str:
    """Compress multiple tool results into a concise summary."""
    combined = "\n\n".join(results)

    # Only summarize if content is large
    if len(combined.split()) < 200:
        return combined

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=max_tokens,
        messages=[{
            "role": "user",
            "content": f"""Summarize these research findings into a concise digest.
Preserve all key facts, names, numbers, and conclusions.

Content to summarize:
{combined}"""
        }]
    )
    return response.content[0].text

class SummarizingAgent:
    def __init__(self, summarize_after: int = 5):
        self.results = []
        self.summarized_context = ""
        self.summarize_after = summarize_after

    def add_result(self, result: str):
        self.results.append(result)
        # Summarize old results when buffer fills
        if len(self.results) >= self.summarize_after:
            self.summarized_context += "\n" + summarize_tool_results(self.results)
            self.results = []  # Clear buffer after summarizing

    def get_context(self) -> str:
        """Return current working context."""
        parts = []
        if self.summarized_context:
            parts.append(f"Summary of prior work:\n{self.summarized_context}")
        if self.results:
            parts.append("Recent findings:\n" + "\n".join(self.results))
        return "\n\n".join(parts)

3. Sliding Window#

Maintain a rolling window of the most recent N messages:

class SlidingWindowContext:
    def __init__(self, max_messages: int = 20, always_keep: int = 3):
        self.messages = []
        self.max_messages = max_messages
        self.always_keep = always_keep  # Always keep first N messages (system prompt, initial task)

    def add_message(self, message: dict):
        self.messages.append(message)
        self._trim()

    def _trim(self):
        if len(self.messages) <= self.max_messages:
            return
        # Keep first always_keep messages + most recent messages
        pinned = self.messages[:self.always_keep]
        recent = self.messages[self.always_keep:][-( self.max_messages - self.always_keep):]
        self.messages = pinned + recent

    def get_messages(self) -> list[dict]:
        return self.messages

4. Retrieval-Augmented Context Injection#

Store information in a vector database and retrieve only what is relevant to the current step:

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

class RetrievalContextManager:
    def __init__(self):
        self.vectorstore = Chroma(embedding_function=OpenAIEmbeddings())
        self.doc_ids = []

    def store_result(self, content: str, metadata: dict = None):
        """Store a tool result for later retrieval."""
        self.vectorstore.add_texts([content], metadatas=[metadata or {}])

    def retrieve_relevant(self, current_query: str, k: int = 3) -> list[str]:
        """Retrieve most relevant stored results for the current reasoning step."""
        results = self.vectorstore.similarity_search(current_query, k=k)
        return [doc.page_content for doc in results]

    def build_step_context(self, current_task: str) -> str:
        """Build focused context for current step by retrieving what is relevant."""
        relevant_results = self.retrieve_relevant(current_task)
        return "\n\n".join(relevant_results)

5. Context in LangGraph#

LangGraph provides explicit state-based context control:

from langgraph.graph import StateGraph
from typing import TypedDict, List, Annotated
import operator

class ResearchState(TypedDict):
    query: str
    # Accumulate results (operator.add appends)
    raw_results: Annotated[List[str], operator.add]
    # Replace summary on each update
    context_summary: str
    # Track what we still need
    remaining_tasks: List[str]

def compress_context_node(state: ResearchState) -> dict:
    """Summarize raw_results when they get large."""
    if sum(len(r) for r in state["raw_results"]) > 10000:
        summary = summarize_tool_results(state["raw_results"])
        return {
            "context_summary": state["context_summary"] + "\n" + summary,
            "raw_results": []  # Clear raw results after summarizing
        }
    return {}

Context Budget Monitoring#

Proactively monitor token usage to avoid hitting limits:

import tiktoken

def estimate_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def build_context_within_budget(
    messages: list[dict],
    system_prompt: str,
    budget: int = 100000
) -> list[dict]:
    """Trim message history to fit within token budget."""
    system_tokens = estimate_tokens(system_prompt)
    available = budget - system_tokens - 2000  # Reserve 2K for response

    # Always include the first and last messages
    if not messages:
        return messages

    first = messages[0]
    last = messages[-1]
    middle = messages[1:-1]

    first_tokens = estimate_tokens(str(first))
    last_tokens = estimate_tokens(str(last))
    remaining_budget = available - first_tokens - last_tokens

    # Add as many middle messages as budget allows (most recent first)
    included_middle = []
    for msg in reversed(middle):
        tokens = estimate_tokens(str(msg))
        if tokens > remaining_budget:
            break
        included_middle.insert(0, msg)
        remaining_budget -= tokens

    return [first] + included_middle + [last]

Common Misconceptions#

Context Window — The model limit being managed
Agent State — The structured data alongside which context lives
AI Agent Memory — Long-term storage that complements context management
Agent Loop — The execution cycle where context accumulates
Agentic Workflow — Multi-step workflows requiring careful context management
Understanding AI Agent Architecture — Architecture tutorial covering memory and context management patterns
CrewAI vs LangChain — Comparing how different frameworks approach context management

What Is Context Management in AI Agents?

Term Snapshot

What Is Context Management in AI Agents?

Quick Definition#

Why Context Management Matters#

Core Context Management Strategies#

1. Selective Retention#

2. Summarization#

3. Sliding Window#

4. Retrieval-Augmented Context Injection#

5. Context in LangGraph#

Context Budget Monitoring#

Common Misconceptions#

Frequently Asked Questions#

What is context management in AI agents?#

Why does context management matter for long-running agents?#

What are the main context management strategies?#

How do agent frameworks handle context management?#

What Is Context Management in AI Agents?

Term Snapshot

What Is Context Management in AI Agents?

Quick Definition#

Why Context Management Matters#

Core Context Management Strategies#

1. Selective Retention#

2. Summarization#

3. Sliding Window#

4. Retrieval-Augmented Context Injection#

5. Context in LangGraph#

Context Budget Monitoring#

Common Misconceptions#

Frequently Asked Questions#

What is context management in AI agents?#

Why does context management matter for long-running agents?#

What are the main context management strategies?#

How do agent frameworks handle context management?#

Term Snapshot

What Is Context Management in AI Agents?

Quick Definition#

Why Context Management Matters#

Core Context Management Strategies#

1. Selective Retention#

2. Summarization#

3. Sliding Window#

4. Retrieval-Augmented Context Injection#

5. Context in LangGraph#

Context Budget Monitoring#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is context management in AI agents?#

Why does context management matter for long-running agents?#

What are the main context management strategies?#

How do agent frameworks handle context management?#

Term Snapshot

What Is Context Management in AI Agents?

Quick Definition#

Why Context Management Matters#

Core Context Management Strategies#

1. Selective Retention#

2. Summarization#

3. Sliding Window#

4. Retrieval-Augmented Context Injection#

5. Context in LangGraph#

Context Budget Monitoring#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is context management in AI agents?#

Why does context management matter for long-running agents?#

What are the main context management strategies?#

How do agent frameworks handle context management?#