What Is Token Efficiency in AI Agents?
Quick Definition#
Token efficiency is the practice of minimizing token consumption in AI agent LLM calls while maintaining output quality. Every token sent to or received from a model has a direct cost and contributes to latency. In production agents that may execute thousands of LLM calls per day, poor token efficiency is the most common cause of unexpectedly high operational costs.
Browse all AI agent terms in the AI Agent Glossary. For understanding token limits as a constraint, see Context Window. For routing queries to cheaper models as an efficiency strategy, see LLM Routing.
Why Token Efficiency Matters at Scale#
The cost of a single LLM call is negligible. The cost of a production agent handling 10,000 requests per day is not:
| Agent Pattern | Avg Tokens/Call | Daily Calls | Monthly Cost (Opus pricing) |
|---|---|---|---|
| Unoptimized | 8,000 | 10,000 | ~$2,700 |
| Optimized (50% reduction) | 4,000 | 10,000 | ~$1,350 |
| Highly optimized (75% reduction) | 2,000 | 10,000 | ~$675 |
A 75% token reduction through prompt engineering, context pruning, and model routing saves $24,000/year on a single agent. Token efficiency is a direct business lever.
Token Consumption Profile#
Before optimizing, understand where tokens actually go in a typical agent session:
from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import List
client = Anthropic()
@dataclass
class TokenBreakdown:
"""Track token consumption by source across an agent session."""
system_prompt_tokens: int = 0
conversation_history_tokens: int = 0
tool_result_tokens: int = 0
output_tokens: int = 0
@property
def total_input_tokens(self) -> int:
return (self.system_prompt_tokens +
self.conversation_history_tokens +
self.tool_result_tokens)
@property
def total_tokens(self) -> int:
return self.total_input_tokens + self.output_tokens
def report(self) -> str:
total = self.total_tokens
return "\n".join([
f"System prompt: {self.system_prompt_tokens:,} "
f"({self.system_prompt_tokens/total*100:.1f}%)",
f"Conversation history: {self.conversation_history_tokens:,} "
f"({self.conversation_history_tokens/total*100:.1f}%)",
f"Tool results: {self.tool_result_tokens:,} "
f"({self.tool_result_tokens/total*100:.1f}%)",
f"Output tokens: {self.output_tokens:,} "
f"({self.output_tokens/total*100:.1f}%)",
f"Total: {total:,} tokens"
])
In a typical research agent: system prompt + history account for 60-80% of input tokens. Tool results (especially web search) add 10-30%. Actual new query context is often under 5%.
Optimization Techniques#
1. Prompt Caching for Static Context#
Anthropic's prompt caching caches the system prompt prefix, eliminating repeated processing costs:
from anthropic import Anthropic
client = Anthropic()
# Static content marked for caching (system prompt + background docs)
CACHED_SYSTEM = """You are a research assistant specializing in AI agents.
[BACKGROUND CONTEXT - CACHED]
This context is cached and only processed once:
{background_docs}
[END CACHED CONTEXT]
"""
def run_with_prompt_cache(user_message: str,
background_docs: str,
model: str = "claude-opus-4-6") -> dict:
"""Use prompt caching to avoid re-processing static context."""
response = client.messages.create(
model=model,
max_tokens=1024,
system=[
{
"type": "text",
"text": CACHED_SYSTEM.format(background_docs=background_docs),
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[{"role": "user", "content": user_message}]
)
# Check if cache was used
cache_read = getattr(response.usage, 'cache_read_input_tokens', 0)
cache_created = getattr(response.usage, 'cache_creation_input_tokens', 0)
return {
"response": response.content[0].text,
"tokens": {
"uncached": response.usage.input_tokens,
"from_cache": cache_read,
"cache_created": cache_created,
"output": response.usage.output_tokens
}
}
Prompt caching provides 90% cost reduction for cached tokens and 85% latency reduction for cache hits. It's the highest-ROI token optimization for agents with long, static system prompts.
2. Conversation History Pruning#
Unbounded conversation history is the most common source of runaway token costs:
from anthropic import Anthropic
import json
client = Anthropic()
def prune_history(messages: list,
max_turns: int = 10,
summarize_older: bool = True) -> list:
"""
Keep recent history verbatim, summarize older turns.
A 'turn' is a user+assistant message pair.
"""
if len(messages) <= max_turns * 2:
return messages # Within limit, no pruning needed
# Split: recent to keep + older to summarize
cutoff = len(messages) - (max_turns * 2)
older_messages = messages[:cutoff]
recent_messages = messages[cutoff:]
if not summarize_older:
return recent_messages
# Summarize older turns with cheap model
summary_response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheap model for summarization
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Summarize this conversation history in 2-3 sentences,
preserving key facts, decisions, and completed steps:
{json.dumps(older_messages, indent=2)}"""
}]
)
summary = summary_response.content[0].text
summary_message = {
"role": "user",
"content": f"[Summary of prior conversation: {summary}]"
}
return [summary_message] + recent_messages
def token_efficient_agent(user_request: str, max_history: int = 8) -> str:
"""Agent with automatic history pruning."""
messages = [{"role": "user", "content": user_request}]
for _ in range(10):
# Prune before each call
pruned_messages = prune_history(messages, max_turns=max_history)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=pruned_messages,
tools=[] # Define tools as needed
)
if response.stop_reason == "end_turn":
return next(b.text for b in response.content if hasattr(b, "text"))
messages.append({"role": "assistant", "content": response.content})
# ... handle tool calls, append results
return "Max iterations reached"
3. Tool Result Truncation#
Tool results — especially web search and document retrieval — can be surprisingly large. Truncate aggressively before passing to the model:
def truncate_tool_result(result: str,
max_chars: int = 2000,
preserve_structure: bool = True) -> str:
"""Truncate tool results while preserving key information."""
if len(result) <= max_chars:
return result
if preserve_structure:
# Keep start (often has the most relevant content)
# and end (often has summary or conclusion)
head = result[:max_chars * 2 // 3]
tail = result[-(max_chars // 3):]
return f"{head}\n\n[... {len(result) - max_chars} characters truncated ...]\n\n{tail}"
return result[:max_chars] + f"\n[... truncated, {len(result) - max_chars} chars remaining]"
def run_web_search(query: str) -> str:
"""Web search with automatic result truncation."""
raw_results = perform_web_search(query) # Your search implementation
return truncate_tool_result(
raw_results,
max_chars=3000, # ~750 tokens
preserve_structure=True
)
4. Output Format Constraints#
Unconstrained output format is a silent token waste. Specify exactly what you need:
# Inefficient: no format constraint
VERBOSE_PROMPT = "Analyze this code and tell me what you think."
# Efficient: explicit format
CONSTRAINED_PROMPT = """Analyze this code. Respond in this exact format:
ISSUES: [numbered list, max 3 items, each under 15 words]
SEVERITY: [critical|high|medium|low]
FIX: [single sentence action item]"""
# Efficient: structured extraction
JSON_CONSTRAINED = """Extract key information. Return JSON only:
{"summary": "max 2 sentences", "action_required": true/false, "priority": 1-5}"""
Format constraints routinely reduce output token count by 40-70% compared to open-ended prompts.
Common Misconceptions#
Misconception: Longer system prompts always produce better behavior System prompt length is a cost that compounds across every call. A 10,000-token system prompt adds $0.15/call at Opus pricing — $1,500/day for a 10,000-call agent. Concise, specific instructions often outperform verbose ones. Remove anything the agent doesn't actively use.
Misconception: Token efficiency degrades quality Aggressive truncation of irrelevant context often improves quality — the "lost in the middle" phenomenon means models attend less reliably to information buried in long contexts. Focused contexts with relevant information outperform context-flooded prompts on most tasks.
Misconception: Only output tokens matter Input tokens are typically cheaper per-unit than output tokens, but they dominate total count in most agents. A system prompt processed 10,000 times contributes far more total cost than the output tokens on those same calls.
Related Terms#
- Context Window — The token limit that makes efficiency necessary
- LLM Routing — Route simple tasks to cheaper models as an efficiency lever
- Agent Tracing — Measure actual token usage to identify optimization opportunities
- Prompt Chaining — Decompose tasks to avoid unnecessarily large single prompts
- Agent Loop — The repeating cycle where token costs accumulate
- Build Your First AI Agent — Tutorial covering cost-aware agent design
- LangChain vs AutoGen — How frameworks support token management
Frequently Asked Questions#
What is token efficiency in AI agents?#
Token efficiency is reducing token consumption in LLM calls while maintaining output quality. It involves prompt caching, conversation history pruning, tool result truncation, output format constraints, and model routing — any technique that reduces the tokens processed per task without degrading results.
What consumes the most tokens in an AI agent?#
System prompts repeated on every call, accumulated conversation history, and verbose tool results (web search, database queries) together account for 60-80% of total token usage in typical agents. Output tokens are a smaller fraction for most analytical or research agents.
How does prompt caching improve token efficiency?#
Anthropic's prompt caching marks static prompt prefixes (system prompt, background docs) for caching. Subsequent calls sharing the same prefix pay only for new tokens. For agents with long system prompts, caching reduces input token costs by 60-90% and latency by 85% on cache hits.
How should I truncate conversation history in agents?#
Keep the most recent 8-10 message pairs verbatim and summarize older turns with a cheap model (e.g., Haiku). Never drop old messages without summarizing — agents lose track of completed steps. Measure quality before deploying truncation: some tasks require more history than others.