What is token efficiency in AI systems?

Token efficiency refers to maximizing the quality of AI outputs relative to the number of tokens consumed. More efficient token usage reduces both cost and latency without sacrificing the information the model needs to complete tasks accurately.

What are the most effective token reduction techniques?

Key techniques: remove redundant instructions from prompts, use structured formats instead of verbose prose, truncate conversation history and summarize instead, use few-shot examples strategically (they cost tokens), compress retrieved context before adding to prompts, and choose concise model response formats.

How do I measure token efficiency in my agent?

Track tokens-per-task-completion using your LLM provider's usage metrics or tools like LangSmith/Langfuse. Compare token usage across prompt versions. Calculate cost-per-successful-outcome as your primary optimization metric.

assorted-color-shape-and-denomination coin lot — Photo by Ben Lambert on Unsplash

What Is Token Efficiency in AI Agents?

Quick Definition#

Token efficiency is the practice of minimizing token consumption in AI agent LLM calls while maintaining output quality. Every token sent to or received from a model has a direct cost and contributes to latency. In production agents that may execute thousands of LLM calls per day, poor token efficiency is the most common cause of unexpectedly high operational costs.

Browse all AI agent terms in the AI Agent Glossary. For understanding token limits as a constraint, see Context Window. For routing queries to cheaper models as an efficiency strategy, see LLM Routing.

Why Token Efficiency Matters at Scale#

The cost of a single LLM call is negligible. The cost of a production agent handling 10,000 requests per day is not:

Agent Pattern	Avg Tokens/Call	Daily Calls	Monthly Cost (Opus pricing)
Unoptimized	8,000	10,000	~$2,700
Optimized (50% reduction)	4,000	10,000	~$1,350
Highly optimized (75% reduction)	2,000	10,000	~$675

A 75% token reduction through prompt engineering, context pruning, and model routing saves $24,000/year on a single agent. Token efficiency is a direct business lever.

Token Consumption Profile#

Before optimizing, understand where tokens actually go in a typical agent session:

from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import List

client = Anthropic()

@dataclass
class TokenBreakdown:
    """Track token consumption by source across an agent session."""
    system_prompt_tokens: int = 0
    conversation_history_tokens: int = 0
    tool_result_tokens: int = 0
    output_tokens: int = 0

    @property
    def total_input_tokens(self) -> int:
        return (self.system_prompt_tokens +
                self.conversation_history_tokens +
                self.tool_result_tokens)

    @property
    def total_tokens(self) -> int:
        return self.total_input_tokens + self.output_tokens

    def report(self) -> str:
        total = self.total_tokens
        return "\n".join([
            f"System prompt: {self.system_prompt_tokens:,} "
            f"({self.system_prompt_tokens/total*100:.1f}%)",
            f"Conversation history: {self.conversation_history_tokens:,} "
            f"({self.conversation_history_tokens/total*100:.1f}%)",
            f"Tool results: {self.tool_result_tokens:,} "
            f"({self.tool_result_tokens/total*100:.1f}%)",
            f"Output tokens: {self.output_tokens:,} "
            f"({self.output_tokens/total*100:.1f}%)",
            f"Total: {total:,} tokens"
        ])

In a typical research agent: system prompt + history account for 60-80% of input tokens. Tool results (especially web search) add 10-30%. Actual new query context is often under 5%.

Optimization Techniques#

1. Prompt Caching for Static Context#

Anthropic's prompt caching caches the system prompt prefix, eliminating repeated processing costs:

from anthropic import Anthropic

client = Anthropic()

# Static content marked for caching (system prompt + background docs)
CACHED_SYSTEM = """You are a research assistant specializing in AI agents.

[BACKGROUND CONTEXT - CACHED]
This context is cached and only processed once:
{background_docs}
[END CACHED CONTEXT]
"""

def run_with_prompt_cache(user_message: str,
                           background_docs: str,
                           model: str = "claude-opus-4-6") -> dict:
    """Use prompt caching to avoid re-processing static context."""
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": CACHED_SYSTEM.format(background_docs=background_docs),
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    # Check if cache was used
    cache_read = getattr(response.usage, 'cache_read_input_tokens', 0)
    cache_created = getattr(response.usage, 'cache_creation_input_tokens', 0)

    return {
        "response": response.content[0].text,
        "tokens": {
            "uncached": response.usage.input_tokens,
            "from_cache": cache_read,
            "cache_created": cache_created,
            "output": response.usage.output_tokens
        }
    }

Prompt caching provides 90% cost reduction for cached tokens and 85% latency reduction for cache hits. It's the highest-ROI token optimization for agents with long, static system prompts.

2. Conversation History Pruning#

Unbounded conversation history is the most common source of runaway token costs:

from anthropic import Anthropic
import json

client = Anthropic()

def prune_history(messages: list,
                  max_turns: int = 10,
                  summarize_older: bool = True) -> list:
    """
    Keep recent history verbatim, summarize older turns.
    A 'turn' is a user+assistant message pair.
    """
    if len(messages) <= max_turns * 2:
        return messages  # Within limit, no pruning needed

    # Split: recent to keep + older to summarize
    cutoff = len(messages) - (max_turns * 2)
    older_messages = messages[:cutoff]
    recent_messages = messages[cutoff:]

    if not summarize_older:
        return recent_messages

    # Summarize older turns with cheap model
    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheap model for summarization
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Summarize this conversation history in 2-3 sentences,
preserving key facts, decisions, and completed steps:

{json.dumps(older_messages, indent=2)}"""
        }]
    )

    summary = summary_response.content[0].text
    summary_message = {
        "role": "user",
        "content": f"[Summary of prior conversation: {summary}]"
    }

    return [summary_message] + recent_messages


def token_efficient_agent(user_request: str, max_history: int = 8) -> str:
    """Agent with automatic history pruning."""
    messages = [{"role": "user", "content": user_request}]

    for _ in range(10):
        # Prune before each call
        pruned_messages = prune_history(messages, max_turns=max_history)

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=pruned_messages,
            tools=[]  # Define tools as needed
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, "text"))

        messages.append({"role": "assistant", "content": response.content})
        # ... handle tool calls, append results

    return "Max iterations reached"

3. Tool Result Truncation#

Tool results — especially web search and document retrieval — can be surprisingly large. Truncate aggressively before passing to the model:

def truncate_tool_result(result: str,
                          max_chars: int = 2000,
                          preserve_structure: bool = True) -> str:
    """Truncate tool results while preserving key information."""
    if len(result) <= max_chars:
        return result

    if preserve_structure:
        # Keep start (often has the most relevant content)
        # and end (often has summary or conclusion)
        head = result[:max_chars * 2 // 3]
        tail = result[-(max_chars // 3):]
        return f"{head}\n\n[... {len(result) - max_chars} characters truncated ...]\n\n{tail}"

    return result[:max_chars] + f"\n[... truncated, {len(result) - max_chars} chars remaining]"


def run_web_search(query: str) -> str:
    """Web search with automatic result truncation."""
    raw_results = perform_web_search(query)  # Your search implementation
    return truncate_tool_result(
        raw_results,
        max_chars=3000,  # ~750 tokens
        preserve_structure=True
    )

4. Output Format Constraints#

Unconstrained output format is a silent token waste. Specify exactly what you need:

# Inefficient: no format constraint
VERBOSE_PROMPT = "Analyze this code and tell me what you think."

# Efficient: explicit format
CONSTRAINED_PROMPT = """Analyze this code. Respond in this exact format:
ISSUES: [numbered list, max 3 items, each under 15 words]
SEVERITY: [critical|high|medium|low]
FIX: [single sentence action item]"""

# Efficient: structured extraction
JSON_CONSTRAINED = """Extract key information. Return JSON only:
{"summary": "max 2 sentences", "action_required": true/false, "priority": 1-5}"""

Format constraints routinely reduce output token count by 40-70% compared to open-ended prompts.

Common Misconceptions#

Misconception: Longer system prompts always produce better behavior System prompt length is a cost that compounds across every call. A 10,000-token system prompt adds $0.15/call at Opus pricing — $1,500/day for a 10,000-call agent. Concise, specific instructions often outperform verbose ones. Remove anything the agent doesn't actively use.

Misconception: Token efficiency degrades quality Aggressive truncation of irrelevant context often improves quality — the "lost in the middle" phenomenon means models attend less reliably to information buried in long contexts. Focused contexts with relevant information outperform context-flooded prompts on most tasks.

Misconception: Only output tokens matter Input tokens are typically cheaper per-unit than output tokens, but they dominate total count in most agents. A system prompt processed 10,000 times contributes far more total cost than the output tokens on those same calls.

Context Window — The token limit that makes efficiency necessary
LLM Routing — Route simple tasks to cheaper models as an efficiency lever
Agent Tracing — Measure actual token usage to identify optimization opportunities
Prompt Chaining — Decompose tasks to avoid unnecessarily large single prompts
Agent Loop — The repeating cycle where token costs accumulate
Build Your First AI Agent — Tutorial covering cost-aware agent design
LangChain vs AutoGen — How frameworks support token management

Frequently Asked Questions#

What is token efficiency in AI agents?#

Token efficiency is reducing token consumption in LLM calls while maintaining output quality. It involves prompt caching, conversation history pruning, tool result truncation, output format constraints, and model routing — any technique that reduces the tokens processed per task without degrading results.

What consumes the most tokens in an AI agent?#

System prompts repeated on every call, accumulated conversation history, and verbose tool results (web search, database queries) together account for 60-80% of total token usage in typical agents. Output tokens are a smaller fraction for most analytical or research agents.

How does prompt caching improve token efficiency?#

Anthropic's prompt caching marks static prompt prefixes (system prompt, background docs) for caching. Subsequent calls sharing the same prefix pay only for new tokens. For agents with long system prompts, caching reduces input token costs by 60-90% and latency by 85% on cache hits.

How should I truncate conversation history in agents?#

Keep the most recent 8-10 message pairs verbatim and summarize older turns with a cheap model (e.g., Haiku). Never drop old messages without summarizing — agents lose track of completed steps. Measure quality before deploying truncation: some tasks require more history than others.

What Is Token Efficiency in AI Agents?

Quick Definition#

Why Token Efficiency Matters at Scale#

The cost of a single LLM call is negligible. The cost of a production agent handling 10,000 requests per day is not:

Agent Pattern	Avg Tokens/Call	Daily Calls	Monthly Cost (Opus pricing)
Unoptimized	8,000	10,000	~$2,700
Optimized (50% reduction)	4,000	10,000	~$1,350
Highly optimized (75% reduction)	2,000	10,000	~$675

A 75% token reduction through prompt engineering, context pruning, and model routing saves $24,000/year on a single agent. Token efficiency is a direct business lever.

Token Consumption Profile#

Before optimizing, understand where tokens actually go in a typical agent session:

from anthropic import Anthropic
from dataclasses import dataclass, field
from typing import List

client = Anthropic()

@dataclass
class TokenBreakdown:
    """Track token consumption by source across an agent session."""
    system_prompt_tokens: int = 0
    conversation_history_tokens: int = 0
    tool_result_tokens: int = 0
    output_tokens: int = 0

    @property
    def total_input_tokens(self) -> int:
        return (self.system_prompt_tokens +
                self.conversation_history_tokens +
                self.tool_result_tokens)

    @property
    def total_tokens(self) -> int:
        return self.total_input_tokens + self.output_tokens

    def report(self) -> str:
        total = self.total_tokens
        return "\n".join([
            f"System prompt: {self.system_prompt_tokens:,} "
            f"({self.system_prompt_tokens/total*100:.1f}%)",
            f"Conversation history: {self.conversation_history_tokens:,} "
            f"({self.conversation_history_tokens/total*100:.1f}%)",
            f"Tool results: {self.tool_result_tokens:,} "
            f"({self.tool_result_tokens/total*100:.1f}%)",
            f"Output tokens: {self.output_tokens:,} "
            f"({self.output_tokens/total*100:.1f}%)",
            f"Total: {total:,} tokens"
        ])

In a typical research agent: system prompt + history account for 60-80% of input tokens. Tool results (especially web search) add 10-30%. Actual new query context is often under 5%.

Optimization Techniques#

1. Prompt Caching for Static Context#

Anthropic's prompt caching caches the system prompt prefix, eliminating repeated processing costs:

from anthropic import Anthropic

client = Anthropic()

# Static content marked for caching (system prompt + background docs)
CACHED_SYSTEM = """You are a research assistant specializing in AI agents.

[BACKGROUND CONTEXT - CACHED]
This context is cached and only processed once:
{background_docs}
[END CACHED CONTEXT]
"""

def run_with_prompt_cache(user_message: str,
                           background_docs: str,
                           model: str = "claude-opus-4-6") -> dict:
    """Use prompt caching to avoid re-processing static context."""
    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": CACHED_SYSTEM.format(background_docs=background_docs),
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            }
        ],
        messages=[{"role": "user", "content": user_message}]
    )

    # Check if cache was used
    cache_read = getattr(response.usage, 'cache_read_input_tokens', 0)
    cache_created = getattr(response.usage, 'cache_creation_input_tokens', 0)

    return {
        "response": response.content[0].text,
        "tokens": {
            "uncached": response.usage.input_tokens,
            "from_cache": cache_read,
            "cache_created": cache_created,
            "output": response.usage.output_tokens
        }
    }

Prompt caching provides 90% cost reduction for cached tokens and 85% latency reduction for cache hits. It's the highest-ROI token optimization for agents with long, static system prompts.

2. Conversation History Pruning#

Unbounded conversation history is the most common source of runaway token costs:

from anthropic import Anthropic
import json

client = Anthropic()

def prune_history(messages: list,
                  max_turns: int = 10,
                  summarize_older: bool = True) -> list:
    """
    Keep recent history verbatim, summarize older turns.
    A 'turn' is a user+assistant message pair.
    """
    if len(messages) <= max_turns * 2:
        return messages  # Within limit, no pruning needed

    # Split: recent to keep + older to summarize
    cutoff = len(messages) - (max_turns * 2)
    older_messages = messages[:cutoff]
    recent_messages = messages[cutoff:]

    if not summarize_older:
        return recent_messages

    # Summarize older turns with cheap model
    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheap model for summarization
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Summarize this conversation history in 2-3 sentences,
preserving key facts, decisions, and completed steps:

{json.dumps(older_messages, indent=2)}"""
        }]
    )

    summary = summary_response.content[0].text
    summary_message = {
        "role": "user",
        "content": f"[Summary of prior conversation: {summary}]"
    }

    return [summary_message] + recent_messages


def token_efficient_agent(user_request: str, max_history: int = 8) -> str:
    """Agent with automatic history pruning."""
    messages = [{"role": "user", "content": user_request}]

    for _ in range(10):
        # Prune before each call
        pruned_messages = prune_history(messages, max_turns=max_history)

        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=2048,
            messages=pruned_messages,
            tools=[]  # Define tools as needed
        )

        if response.stop_reason == "end_turn":
            return next(b.text for b in response.content if hasattr(b, "text"))

        messages.append({"role": "assistant", "content": response.content})
        # ... handle tool calls, append results

    return "Max iterations reached"

3. Tool Result Truncation#

Tool results — especially web search and document retrieval — can be surprisingly large. Truncate aggressively before passing to the model:

def truncate_tool_result(result: str,
                          max_chars: int = 2000,
                          preserve_structure: bool = True) -> str:
    """Truncate tool results while preserving key information."""
    if len(result) <= max_chars:
        return result

    if preserve_structure:
        # Keep start (often has the most relevant content)
        # and end (often has summary or conclusion)
        head = result[:max_chars * 2 // 3]
        tail = result[-(max_chars // 3):]
        return f"{head}\n\n[... {len(result) - max_chars} characters truncated ...]\n\n{tail}"

    return result[:max_chars] + f"\n[... truncated, {len(result) - max_chars} chars remaining]"


def run_web_search(query: str) -> str:
    """Web search with automatic result truncation."""
    raw_results = perform_web_search(query)  # Your search implementation
    return truncate_tool_result(
        raw_results,
        max_chars=3000,  # ~750 tokens
        preserve_structure=True
    )

4. Output Format Constraints#

Unconstrained output format is a silent token waste. Specify exactly what you need:

# Inefficient: no format constraint
VERBOSE_PROMPT = "Analyze this code and tell me what you think."

# Efficient: explicit format
CONSTRAINED_PROMPT = """Analyze this code. Respond in this exact format:
ISSUES: [numbered list, max 3 items, each under 15 words]
SEVERITY: [critical|high|medium|low]
FIX: [single sentence action item]"""

# Efficient: structured extraction
JSON_CONSTRAINED = """Extract key information. Return JSON only:
{"summary": "max 2 sentences", "action_required": true/false, "priority": 1-5}"""

Format constraints routinely reduce output token count by 40-70% compared to open-ended prompts.

Common Misconceptions#

Context Window — The token limit that makes efficiency necessary
LLM Routing — Route simple tasks to cheaper models as an efficiency lever
Agent Tracing — Measure actual token usage to identify optimization opportunities
Prompt Chaining — Decompose tasks to avoid unnecessarily large single prompts
Agent Loop — The repeating cycle where token costs accumulate
Build Your First AI Agent — Tutorial covering cost-aware agent design
LangChain vs AutoGen — How frameworks support token management

What Is Token Efficiency in AI Agents?

Term Snapshot

What Is Token Efficiency in AI Agents?

Quick Definition#

Why Token Efficiency Matters at Scale#

Token Consumption Profile#

Optimization Techniques#

1. Prompt Caching for Static Context#

2. Conversation History Pruning#

3. Tool Result Truncation#

4. Output Format Constraints#

Common Misconceptions#

Frequently Asked Questions#

What is token efficiency in AI agents?#

What consumes the most tokens in an AI agent?#

How does prompt caching improve token efficiency?#

How should I truncate conversation history in agents?#

What Is Token Efficiency in AI Agents?

Term Snapshot

What Is Token Efficiency in AI Agents?

Quick Definition#

Why Token Efficiency Matters at Scale#

Token Consumption Profile#

Optimization Techniques#

1. Prompt Caching for Static Context#

2. Conversation History Pruning#

3. Tool Result Truncation#

4. Output Format Constraints#

Common Misconceptions#

Frequently Asked Questions#

What is token efficiency in AI agents?#

What consumes the most tokens in an AI agent?#

How does prompt caching improve token efficiency?#

How should I truncate conversation history in agents?#

Term Snapshot

What Is Token Efficiency in AI Agents?

Quick Definition#

Why Token Efficiency Matters at Scale#

Token Consumption Profile#

Optimization Techniques#

1. Prompt Caching for Static Context#

2. Conversation History Pruning#

3. Tool Result Truncation#

4. Output Format Constraints#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is token efficiency in AI agents?#

What consumes the most tokens in an AI agent?#

How does prompt caching improve token efficiency?#

How should I truncate conversation history in agents?#

Term Snapshot

What Is Token Efficiency in AI Agents?

Quick Definition#

Why Token Efficiency Matters at Scale#

Token Consumption Profile#

Optimization Techniques#

1. Prompt Caching for Static Context#

2. Conversation History Pruning#

3. Tool Result Truncation#

4. Output Format Constraints#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is token efficiency in AI agents?#

What consumes the most tokens in an AI agent?#

How does prompt caching improve token efficiency?#

How should I truncate conversation history in agents?#