🤖AI Agents Guide
TutorialsComparisonsReviewsExamplesIntegrationsUse CasesTemplatesGlossary
Get Started
🤖AI Agents Guide

Your comprehensive resource for understanding, building, and implementing AI Agents.

Learn

  • Tutorials
  • Glossary
  • Use Cases
  • Examples

Compare

  • Tool Comparisons
  • Reviews
  • Integrations
  • Templates

Company

  • About
  • Contact
  • Privacy Policy

© 2026 AI Agents Guide. All rights reserved.

Home/Glossary/What Is LLM Cost per Token? (2026)
Glossary7 min read

What Is LLM Cost per Token? (2026)

LLM cost per token explains how AI language model pricing works — input vs output tokens, prompt caching discounts, batch API pricing, and a full cost comparison across GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.

a group of coins
Photo by Allison Saeng on Unsplash
By AI Agents Guide Team•March 1, 2026

Term Snapshot

Also known as: AI Token Pricing, LLM Inference Cost, Token Cost Calculation

Related terms: What Is Token Efficiency in AI Agents?, What Is Agent Cost Optimization?, What Is LLM Routing?, What Are AI Agents?

Table of Contents

  1. What Is LLM Cost per Token?
  2. How Token Pricing Works
  3. Input vs. Output Token Pricing
  4. Pricing Units
  5. LLM Pricing Comparison (2026)
  6. Frontier Models
  7. Mid-Tier Models (Best Value)
  8. Cost Ratio: Frontier vs. Mid-Tier
  9. Prompt Caching: The Highest-Leverage Discount
  10. Anthropic Prompt Caching
  11. OpenAI Prompt Caching
  12. Which Provider Offers Better Caching?
  13. Batch API Pricing
  14. OpenAI Batch API
  15. Anthropic Message Batches
  16. Real-World Cost Scenarios
  17. Scenario 1: Customer Service Agent (Real-time)
  18. Scenario 2: Document Analysis Pipeline (Batch)
  19. Monitoring Token Costs in Production
  20. Related Resources
  21. More Resources
a stack of gold coins sitting next to each other
Photo by William Warby on Unsplash

What Is LLM Cost per Token?#

LLM cost per token is the pricing unit used by AI model providers to charge for usage of large language models. A "token" is roughly equivalent to 4 characters or 0.75 words in English — "the quick brown fox" is approximately 4 tokens. Providers charge separately for input tokens (what you send to the model) and output tokens (what the model generates back), with output tokens consistently priced higher.

Understanding token pricing is foundational to estimating, managing, and optimizing AI agent costs at scale. A single LLM call might cost a fraction of a cent, but production agents making thousands of calls per hour can accumulate significant monthly bills.

How Token Pricing Works#

Input vs. Output Token Pricing#

The fundamental pricing split:

  • Input tokens = your prompt, system instructions, conversation history, retrieved context
  • Output tokens = the model's generated response

Output tokens are more expensive because text generation (autoregressive decoding) is computationally more intensive than encoding input text. The pricing ratio varies by model:

  • GPT-4o: Output is 4x more expensive than input
  • Claude 3.5 Sonnet: Output is 5x more expensive than input
  • Gemini 1.5 Pro: Output is 4x more expensive than input

This asymmetry means that for agents producing long, detailed outputs — reports, code, elaborate plans — output costs dominate. For classification agents that return short labels, input costs dominate. Design your agent's output format accordingly.

Pricing Units#

Most providers price in "per million tokens" to avoid small decimal numbers. When you see "$2.50/1M input tokens," that means:

  • 1 million tokens = $2.50
  • 1,000 tokens = $0.0025
  • 100 tokens = $0.00025

A typical agent interaction with 2,000 input tokens and 500 output tokens using GPT-4o costs approximately: (2,000/1,000,000 x $2.50) + (500/1,000,000 x $10.00) = $0.005 + $0.005 = $0.01 — one cent per interaction.

LLM Pricing Comparison (2026)#

Frontier Models#

ModelInput (per 1M)Output (per 1M)Context WindowNotes
GPT-4o$2.50$10.00128KCached input: $1.25
Claude 3.5 Sonnet$3.00$15.00200KCached input: $0.30
Gemini 1.5 Pro$1.25$5.002MRates apply ≤128K
Gemini 1.5 Pro (long)$2.50$10.002MRates apply >128K

Mid-Tier Models (Best Value)#

ModelInput (per 1M)Output (per 1M)Best For
GPT-4o-mini$0.15$0.60Classification, routing, simple extraction
Claude 3.5 Haiku$0.80$4.00Fast complex tasks, coding, structured output
Gemini 1.5 Flash$0.075$0.30High-volume, cost-sensitive workloads

Cost Ratio: Frontier vs. Mid-Tier#

Using GPT-4o vs GPT-4o-mini for the same input volume:

  • GPT-4o input: $2.50/1M
  • GPT-4o-mini input: $0.15/1M
  • Cost savings: 94% cheaper for input tokens

The quality gap has narrowed significantly in 2025-2026. For straightforward tasks, mid-tier models perform comparably to frontier models at a fraction of the cost.

Prompt Caching: The Highest-Leverage Discount#

Prompt caching is where the largest cost savings for production agents come from. Both major providers now offer caching, with different mechanics:

Anthropic Prompt Caching#

Anthropic's prefix caching stores the beginning of your prompt (system instructions + stable context) and serves cache hits at approximately 10% of the normal input price — a 90% discount.

Cache write cost: 25% premium over standard input pricing on first write Cache hit cost: ~10% of standard input pricing Cache lifetime: 5 minutes (refreshed on each hit) Minimum cacheable tokens: 1,024 tokens (Haiku), 2,048 tokens (Sonnet/Opus)

Example calculation:

  • System prompt: 8,000 tokens
  • Normal input cost: 8,000 x $3.00/1M = $0.024 per call
  • Cache hit cost: 8,000 x $0.30/1M = $0.0024 per call
  • Daily savings at 1,000 calls: $21.60

OpenAI Prompt Caching#

OpenAI applies automatic prompt caching for prompts over 1,024 tokens, with no developer configuration required. Cache hits are charged at 50% of the standard input price.

Cache hit cost: 50% of standard input price Cache eligibility: Prompts ≥1,024 tokens, automatically managed Cache lifetime: Varies; frequently used prompts stay cached longer

OpenAI's caching is less dramatic than Anthropic's but requires zero implementation effort.

Which Provider Offers Better Caching?#

For agents with large, stable system prompts:

  • Anthropic wins for maximum savings (90% off cache hits)
  • OpenAI wins for ease of implementation (zero configuration)
  • At scale with large prompts, Anthropic's deeper discount justifies the explicit cache management

Finance and data charts illustrating cost comparison across LLM providers

Batch API Pricing#

For workloads that can tolerate latency (typically 1-24 hours), batch APIs offer significant discounts:

OpenAI Batch API#

  • Discount: 50% off standard pricing
  • Delivery: Within 24 hours
  • Use cases: Bulk document processing, overnight analysis, content generation pipelines
  • Limitation: No streaming, asynchronous only

Anthropic Message Batches#

  • Discount: 50% off standard pricing
  • Delivery: Within 24 hours
  • Use cases: Same as OpenAI; particularly cost-effective with Claude 3.5 Sonnet

Combined with caching: Batch processing + prompt caching can reduce frontier model costs by 70-80% for suitable workloads.

Real-World Cost Scenarios#

Scenario 1: Customer Service Agent (Real-time)#

  • Volume: 5,000 conversations/day
  • Input: 3,000 tokens avg (system prompt + history + user message)
  • Output: 500 tokens avg (response)
  • Model: Claude 3.5 Sonnet with prompt caching
Uncached cost: (3,000 x $3.00 + 500 x $15.00) / 1,000,000 x 5,000
= ($9,000 + $7,500) / 1,000,000 x 5,000
= $82.50/day = ~$2,475/month

With 80% cache hit rate (2,400 cached + 600 uncached input tokens):
Cached portion: 2,400 x $0.30/1M x 5,000 = $3.60/day
Uncached portion: 600 x $3.00/1M x 5,000 = $9.00/day
Output: 500 x $15.00/1M x 5,000 = $37.50/day
Total: ~$50.10/day = ~$1,503/month

Savings: ~$972/month (39% reduction)

Scenario 2: Document Analysis Pipeline (Batch)#

  • Volume: 10,000 documents/day
  • Input: 8,000 tokens avg per document
  • Output: 300 tokens avg (structured extract)
  • Model: GPT-4o batch API
Standard GPT-4o: (8,000 x $2.50 + 300 x $10.00) / 1,000,000 x 10,000
= ($20,000 + $3,000) / 1,000,000 x 10,000 = $230/day

Batch API (50% off): $115/day = ~$3,450/month

Monitoring Token Costs in Production#

Never deploy a production agent without cost monitoring. Integrate observability from day one:

  • LangFuse: Open-source, traces every LLM call with cost attribution. Self-hostable.
  • Helicone: Proxy-based cost tracking with zero code changes. Shows cost per user, per model, per endpoint.
  • Provider dashboards: OpenAI, Anthropic, and Google all provide usage dashboards with per-model breakdowns. Set billing alerts.

For teams optimizing costs, set up dashboards tracking: cost per task, cost per user session, model distribution, and cache hit rates.

Related Resources#

  • Agent Cost Optimization Techniques
  • Token Efficiency
  • LLM Routing
  • How Much Does It Cost to Build an AI Agent?
  • Best AI Agent Observability Tools

More Resources#

Browse the complete AI agent glossary for more AI agent terminology.

Tags:
pricingoperationscost

Related Glossary Terms

What Is Agent Cost Optimization?

Agent cost optimization covers techniques to reduce the operational cost of running AI agents — including prompt caching, LLM routing, request batching, smaller model selection, and context window management.

What Are AI Agent Benchmarks?

AI agent benchmarks are standardized evaluation frameworks that measure how well AI agents perform on defined tasks — enabling objective comparison of frameworks, models, and architectures across dimensions like task completion rate, tool use accuracy, multi-step reasoning, and safety.

What Are Agent Deployment Patterns?

Agent deployment patterns are established architectural approaches for shipping AI agents to production — including containerized microservices, serverless functions, persistent daemons, and edge deployments — each offering different trade-offs in latency, cost, scalability, and operational complexity.

What Is Agent Error Recovery?

Agent error recovery refers to the mechanisms AI agents use to detect failures, handle exceptions, retry operations with appropriate backoff, escalate to human review when needed, and resume work after encountering errors — essential for building agents that remain reliable in unpredictable production environments.

← Back to Glossary