What Is LLM Routing?
Quick Definition#
LLM routing is the practice of directing AI queries or tasks to different language models based on their complexity, cost requirements, latency constraints, or specialized capabilities. Rather than sending every request to the most expensive frontier model, routing ensures simple tasks go to fast, cheap models while complex reasoning reaches powerful models that can handle it. In agent systems, routing can significantly reduce operational costs — often 50–80% — while maintaining quality where it matters.
Browse all AI agent terms in the AI Agent Glossary. For the execution infrastructure where routing operates, see Agent Runtime. For understanding the context window constraints that influence routing decisions, see Context Window.
Why LLM Routing Matters#
The economics of LLM deployment create a fundamental tradeoff: powerful models are 10–100× more expensive per token than smaller models, but not every query requires frontier-model capability:
| Task Type | Complexity | Suitable Model |
|---|---|---|
| Intent classification | Low | Small (Haiku, GPT-3.5) |
| Entity extraction | Low | Small |
| Simple Q&A with context | Medium | Medium (Sonnet, GPT-4o-mini) |
| Multi-step reasoning | High | Large (Opus, GPT-4o) |
| Code generation | High | Large or specialized |
| Complex analysis + synthesis | Very High | Large frontier |
Sending all queries to the most capable model is expensive and unnecessary. Sending all queries to the cheapest model fails on complex tasks. Routing finds the efficient frontier.
Routing Strategies#
1. Rule-Based Routing#
Explicit rules dispatch tasks based on detected properties:
from anthropic import Anthropic
client = Anthropic()
MODEL_TIERS = {
"fast": "claude-haiku-4-5-20251001", # Cheapest, fastest
"balanced": "claude-sonnet-4-6", # Middle tier
"powerful": "claude-opus-4-6" # Most capable, most expensive
}
def classify_query_complexity(query: str) -> str:
"""Rule-based routing by query characteristics."""
query_lower = query.lower()
words = query.split()
# Simple classification tasks → fast model
if len(words) < 10 and any(kw in query_lower for kw in
["classify", "label", "is this", "yes or no", "true or false"]):
return "fast"
# Complex reasoning indicators → powerful model
complex_indicators = [
"analyze", "synthesize", "compare and contrast",
"provide a comprehensive", "step by step", "design",
"explain why", "what are the implications"
]
if any(indicator in query_lower for indicator in complex_indicators):
return "powerful"
# Length-based: long queries often require more reasoning
if len(words) > 100:
return "powerful"
# Default: balanced tier
return "balanced"
def route_query(query: str, system_prompt: str = "") -> str:
"""Route query to appropriate model tier."""
tier = classify_query_complexity(query)
model = MODEL_TIERS[tier]
response = client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
print(f"Routed to: {tier} ({model})")
return response.content[0].text
2. Semantic Routing#
Use embeddings to detect query intent and route to specialized model:
from anthropic import Anthropic
client = Anthropic()
# Define routing categories with example queries
ROUTING_EXAMPLES = {
"code": [
"write a function that sorts a list",
"debug this Python code",
"explain what this regex does",
"how do I implement authentication in FastAPI"
],
"analysis": [
"analyze the key trends in this data",
"what are the strategic implications of this decision",
"compare these two approaches"
],
"extraction": [
"extract all email addresses from this text",
"what is the company name mentioned here",
"list all the dates in this document"
]
}
MODEL_FOR_TASK = {
"code": "claude-sonnet-4-6",
"analysis": "claude-opus-4-6",
"extraction": "claude-haiku-4-5-20251001"
}
def semantic_route(query: str) -> str:
"""Use a fast model to classify and route the query."""
# Use cheap model to classify
classification = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=10,
messages=[{
"role": "user",
"content": f"""Classify this query as one of: code, analysis, extraction
Query: {query}
Respond with only the category name."""
}]
)
category = classification.content[0].text.strip().lower()
if category not in MODEL_FOR_TASK:
category = "analysis" # Default to most capable for unknown types
target_model = MODEL_FOR_TASK[category]
# Execute on appropriate model
response = client.messages.create(
model=target_model,
max_tokens=2048,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text
3. Cascading Routing (Quality-Gated Escalation)#
Start with a cheap model; escalate if the response confidence is insufficient:
from anthropic import Anthropic
import json
client = Anthropic()
def cascading_route(query: str,
confidence_threshold: float = 0.8) -> dict:
"""Try cheap model first; escalate if confidence too low."""
# Tier 1: Try fast model with confidence self-assessment
tier1_response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1500,
messages=[{
"role": "user",
"content": f"""{query}
After your response, assess your confidence on a 0-1 scale.
Format: {{"answer": "...", "confidence": 0.0-1.0, "reason": "..."}}"""
}]
)
try:
result = json.loads(tier1_response.content[0].text)
confidence = result.get("confidence", 0.5)
except json.JSONDecodeError:
# If JSON parse fails, treat as low confidence
result = {"answer": tier1_response.content[0].text}
confidence = 0.5
# If confidence sufficient, return tier 1 result
if confidence >= confidence_threshold:
return {
"answer": result.get("answer", ""),
"model_used": "claude-haiku-4-5-20251001",
"confidence": confidence,
"escalated": False
}
# Escalate to more capable model
tier2_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=[{"role": "user", "content": query}]
)
return {
"answer": tier2_response.content[0].text,
"model_used": "claude-opus-4-6",
"confidence": 1.0, # Assume high confidence from powerful model
"escalated": True,
"escalation_reason": result.get("reason", "Low confidence from tier 1")
}
4. Cost-Aware Routing with LiteLLM#
LiteLLM provides unified access to 100+ models with built-in routing:
from litellm import completion
def cost_aware_route(query: str, max_cost_per_1k_tokens: float = 0.01):
"""Route to highest-quality model within cost budget."""
# LiteLLM model catalog with cost tiers
models_by_cost = [
{"model": "claude-haiku-4-5-20251001", "cost": 0.001}, # Cheapest
{"model": "claude-sonnet-4-6", "cost": 0.005},
{"model": "claude-opus-4-6", "cost": 0.015} # Most expensive
]
# Select best model within budget
selected = next(
(m for m in reversed(models_by_cost) if m["cost"] <= max_cost_per_1k_tokens),
models_by_cost[0] # Fallback to cheapest
)
response = completion(
model=selected["model"],
messages=[{"role": "user", "content": query}]
)
return {
"response": response.choices[0].message.content,
"model": selected["model"],
"estimated_cost_per_1k": selected["cost"]
}
LLM Routing in Agent Workflows#
In multi-step agents, different steps often warrant different models:
class RoutedAgent:
"""Agent that routes different steps to appropriate models."""
STEP_MODELS = {
"planning": "claude-opus-4-6", # Complex reasoning
"tool_call_decision": "claude-sonnet-4-6", # Moderate reasoning
"result_classification": "claude-haiku-4-5-20251001", # Simple classification
"summarization": "claude-sonnet-4-6", # Moderate capability
"final_synthesis": "claude-opus-4-6" # Complex integration
}
def run_step(self, step_type: str, prompt: str) -> str:
model = self.STEP_MODELS.get(step_type, "claude-sonnet-4-6")
response = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Common Misconceptions#
Misconception: Routing always degrades quality Routing degrades quality only when queries are misrouted to models below their capability requirements. For well-classified queries, routing is transparent to quality — a simple extraction task done by a fast model is often better in quality (lower hallucination risk) than the same task done by an overtrained frontier model.
Misconception: Routing complexity outweighs the savings At scale, routing logic typically costs pennies per thousand queries (using a cheap classifier) while saving dollars in reduced powerful-model usage. Even a simple 2-tier rule-based router with 50% traffic on the cheaper model saves 40-60% on LLM costs.
Misconception: Routing only matters for cost, not performance Routing also improves latency: fast models respond in 200-500ms vs 2-5 seconds for large models. For latency-sensitive applications (chatbots, real-time analysis), routing simple queries to fast models significantly improves user experience.
Related Terms#
- Agent Runtime — The execution infrastructure implementing routing decisions
- Agent SDK — SDKs like LiteLLM that provide routing abstractions
- Context Window — Context limits that influence model selection
- Tool Calling — Tool-calling capability differences across model tiers
- Agentic Workflow — Multi-step workflows where per-step routing reduces costs
- Build Your First AI Agent — Tutorial covering model selection in agent design
- LangChain vs AutoGen — How frameworks handle model routing across tasks
Frequently Asked Questions#
What is LLM routing?#
LLM routing directs AI queries to different language models based on complexity, cost, or capability requirements. Simple tasks go to fast, cheap models; complex reasoning goes to powerful frontier models. This approach typically reduces LLM costs by 50-80% while maintaining quality on the tasks that need it.
What are the main approaches to LLM routing?#
The main approaches are: rule-based routing (explicit complexity thresholds and keyword matching), semantic routing (classifier detecting task intent), cascading (start cheap, escalate if confidence is low), and cost-aware routing (select highest-quality model within a cost budget). LiteLLM and RouteLLM are popular open-source tools.
When should I use LLM routing vs always using the best model?#
Use routing when your application handles diverse query types at scale, cost is a concern, or latency matters for simpler operations. Always use the best model when all queries are uniformly complex, or when routing overhead exceeds savings (typically only for very low-volume applications).
How accurate does LLM routing need to be?#
For quality-sensitive applications, target >90% routing accuracy. For pure cost-optimization, occasional over-routing (complex queries to powerful models) costs slightly more but doesn't harm quality. Build escalation fallbacks: when a routed model's response confidence is low, retry with a more capable model.