LLM routing is the practice of automatically directing different tasks or queries to different language models based on task complexity, required capabilities, cost constraints, or latency requirements. A router analyzes each request and selects the most appropriate model.

What are the benefits of LLM routing?

Cost reduction (route simple tasks to cheaper models), latency optimization (use fast models for simple queries), capability matching (route complex reasoning to capable models), and redundancy (fallback to alternative models when one is unavailable).

How is LLM routing implemented in practice?

Routing can be rule-based (task type → model), ML-based (train a classifier to predict the optimal model), or LLM-based (a small, fast model evaluates the query and selects a model). Tools like LiteLLM, PortKey, and OpenRouter provide managed routing infrastructure.

Data center infrastructure representing LLM routing systems — Photo by imgix on Unsplash

What Is LLM Routing?

Quick Definition#

LLM routing is the practice of directing AI queries or tasks to different language models based on their complexity, cost requirements, latency constraints, or specialized capabilities. Rather than sending every request to the most expensive frontier model, routing ensures simple tasks go to fast, cheap models while complex reasoning reaches powerful models that can handle it. In agent systems, routing can significantly reduce operational costs — often 50–80% — while maintaining quality where it matters.

Browse all AI agent terms in the AI Agent Glossary. For the execution infrastructure where routing operates, see Agent Runtime. For understanding the context window constraints that influence routing decisions, see Context Window.

Why LLM Routing Matters#

The economics of LLM deployment create a fundamental tradeoff: powerful models are 10–100× more expensive per token than smaller models, but not every query requires frontier-model capability:

Task Type	Complexity	Suitable Model
Intent classification	Low	Small (Haiku, GPT-3.5)
Entity extraction	Low	Small
Simple Q&A with context	Medium	Medium (Sonnet, GPT-4o-mini)
Multi-step reasoning	High	Large (Opus, GPT-4o)
Code generation	High	Large or specialized
Complex analysis + synthesis	Very High	Large frontier

Sending all queries to the most capable model is expensive and unnecessary. Sending all queries to the cheapest model fails on complex tasks. Routing finds the efficient frontier.

Routing Strategies#

1. Rule-Based Routing#

Explicit rules dispatch tasks based on detected properties:

from anthropic import Anthropic

client = Anthropic()

MODEL_TIERS = {
    "fast": "claude-haiku-4-5-20251001",       # Cheapest, fastest
    "balanced": "claude-sonnet-4-6",   # Middle tier
    "powerful": "claude-opus-4-6"      # Most capable, most expensive
}

def classify_query_complexity(query: str) -> str:
    """Rule-based routing by query characteristics."""
    query_lower = query.lower()
    words = query.split()

    # Simple classification tasks → fast model
    if len(words) < 10 and any(kw in query_lower for kw in
                                ["classify", "label", "is this", "yes or no", "true or false"]):
        return "fast"

    # Complex reasoning indicators → powerful model
    complex_indicators = [
        "analyze", "synthesize", "compare and contrast",
        "provide a comprehensive", "step by step", "design",
        "explain why", "what are the implications"
    ]
    if any(indicator in query_lower for indicator in complex_indicators):
        return "powerful"

    # Length-based: long queries often require more reasoning
    if len(words) > 100:
        return "powerful"

    # Default: balanced tier
    return "balanced"

def route_query(query: str, system_prompt: str = "") -> str:
    """Route query to appropriate model tier."""
    tier = classify_query_complexity(query)
    model = MODEL_TIERS[tier]

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )

    print(f"Routed to: {tier} ({model})")
    return response.content[0].text

2. Semantic Routing#

Use embeddings to detect query intent and route to specialized model:

from anthropic import Anthropic

client = Anthropic()

# Define routing categories with example queries
ROUTING_EXAMPLES = {
    "code": [
        "write a function that sorts a list",
        "debug this Python code",
        "explain what this regex does",
        "how do I implement authentication in FastAPI"
    ],
    "analysis": [
        "analyze the key trends in this data",
        "what are the strategic implications of this decision",
        "compare these two approaches"
    ],
    "extraction": [
        "extract all email addresses from this text",
        "what is the company name mentioned here",
        "list all the dates in this document"
    ]
}

MODEL_FOR_TASK = {
    "code": "claude-sonnet-4-6",
    "analysis": "claude-opus-4-6",
    "extraction": "claude-haiku-4-5-20251001"
}

def semantic_route(query: str) -> str:
    """Use a fast model to classify and route the query."""
    # Use cheap model to classify
    classification = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": f"""Classify this query as one of: code, analysis, extraction
Query: {query}
Respond with only the category name."""
        }]
    )

    category = classification.content[0].text.strip().lower()
    if category not in MODEL_FOR_TASK:
        category = "analysis"  # Default to most capable for unknown types

    target_model = MODEL_FOR_TASK[category]

    # Execute on appropriate model
    response = client.messages.create(
        model=target_model,
        max_tokens=2048,
        messages=[{"role": "user", "content": query}]
    )

    return response.content[0].text

3. Cascading Routing (Quality-Gated Escalation)#

Start with a cheap model; escalate if the response confidence is insufficient:

from anthropic import Anthropic
import json

client = Anthropic()

def cascading_route(query: str,
                    confidence_threshold: float = 0.8) -> dict:
    """Try cheap model first; escalate if confidence too low."""

    # Tier 1: Try fast model with confidence self-assessment
    tier1_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"""{query}

After your response, assess your confidence on a 0-1 scale.
Format: {{"answer": "...", "confidence": 0.0-1.0, "reason": "..."}}"""
        }]
    )

    try:
        result = json.loads(tier1_response.content[0].text)
        confidence = result.get("confidence", 0.5)
    except json.JSONDecodeError:
        # If JSON parse fails, treat as low confidence
        result = {"answer": tier1_response.content[0].text}
        confidence = 0.5

    # If confidence sufficient, return tier 1 result
    if confidence >= confidence_threshold:
        return {
            "answer": result.get("answer", ""),
            "model_used": "claude-haiku-4-5-20251001",
            "confidence": confidence,
            "escalated": False
        }

    # Escalate to more capable model
    tier2_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": query}]
    )

    return {
        "answer": tier2_response.content[0].text,
        "model_used": "claude-opus-4-6",
        "confidence": 1.0,  # Assume high confidence from powerful model
        "escalated": True,
        "escalation_reason": result.get("reason", "Low confidence from tier 1")
    }

4. Cost-Aware Routing with LiteLLM#

LiteLLM provides unified access to 100+ models with built-in routing:

from litellm import completion

def cost_aware_route(query: str, max_cost_per_1k_tokens: float = 0.01):
    """Route to highest-quality model within cost budget."""

    # LiteLLM model catalog with cost tiers
    models_by_cost = [
        {"model": "claude-haiku-4-5-20251001", "cost": 0.001},  # Cheapest
        {"model": "claude-sonnet-4-6", "cost": 0.005},
        {"model": "claude-opus-4-6", "cost": 0.015}  # Most expensive
    ]

    # Select best model within budget
    selected = next(
        (m for m in reversed(models_by_cost) if m["cost"] <= max_cost_per_1k_tokens),
        models_by_cost[0]  # Fallback to cheapest
    )

    response = completion(
        model=selected["model"],
        messages=[{"role": "user", "content": query}]
    )

    return {
        "response": response.choices[0].message.content,
        "model": selected["model"],
        "estimated_cost_per_1k": selected["cost"]
    }

LLM Routing in Agent Workflows#

In multi-step agents, different steps often warrant different models:

class RoutedAgent:
    """Agent that routes different steps to appropriate models."""

    STEP_MODELS = {
        "planning": "claude-opus-4-6",         # Complex reasoning
        "tool_call_decision": "claude-sonnet-4-6",  # Moderate reasoning
        "result_classification": "claude-haiku-4-5-20251001",  # Simple classification
        "summarization": "claude-sonnet-4-6",  # Moderate capability
        "final_synthesis": "claude-opus-4-6"  # Complex integration
    }

    def run_step(self, step_type: str, prompt: str) -> str:
        model = self.STEP_MODELS.get(step_type, "claude-sonnet-4-6")
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

Common Misconceptions#

Misconception: Routing always degrades quality Routing degrades quality only when queries are misrouted to models below their capability requirements. For well-classified queries, routing is transparent to quality — a simple extraction task done by a fast model is often better in quality (lower hallucination risk) than the same task done by an overtrained frontier model.

Misconception: Routing complexity outweighs the savings At scale, routing logic typically costs pennies per thousand queries (using a cheap classifier) while saving dollars in reduced powerful-model usage. Even a simple 2-tier rule-based router with 50% traffic on the cheaper model saves 40-60% on LLM costs.

Misconception: Routing only matters for cost, not performance Routing also improves latency: fast models respond in 200-500ms vs 2-5 seconds for large models. For latency-sensitive applications (chatbots, real-time analysis), routing simple queries to fast models significantly improves user experience.

Agent Runtime — The execution infrastructure implementing routing decisions
Agent SDK — SDKs like LiteLLM that provide routing abstractions
Context Window — Context limits that influence model selection
Tool Calling — Tool-calling capability differences across model tiers
Agentic Workflow — Multi-step workflows where per-step routing reduces costs
Build Your First AI Agent — Tutorial covering model selection in agent design
LangChain vs AutoGen — How frameworks handle model routing across tasks

Frequently Asked Questions#

What is LLM routing?#

LLM routing directs AI queries to different language models based on complexity, cost, or capability requirements. Simple tasks go to fast, cheap models; complex reasoning goes to powerful frontier models. This approach typically reduces LLM costs by 50-80% while maintaining quality on the tasks that need it.

What are the main approaches to LLM routing?#

The main approaches are: rule-based routing (explicit complexity thresholds and keyword matching), semantic routing (classifier detecting task intent), cascading (start cheap, escalate if confidence is low), and cost-aware routing (select highest-quality model within a cost budget). LiteLLM and RouteLLM are popular open-source tools.

When should I use LLM routing vs always using the best model?#

Use routing when your application handles diverse query types at scale, cost is a concern, or latency matters for simpler operations. Always use the best model when all queries are uniformly complex, or when routing overhead exceeds savings (typically only for very low-volume applications).

How accurate does LLM routing need to be?#

For quality-sensitive applications, target >90% routing accuracy. For pure cost-optimization, occasional over-routing (complex queries to powerful models) costs slightly more but doesn't harm quality. Build escalation fallbacks: when a routed model's response confidence is low, retry with a more capable model.

What Is LLM Routing?

Quick Definition#

Why LLM Routing Matters#

The economics of LLM deployment create a fundamental tradeoff: powerful models are 10–100× more expensive per token than smaller models, but not every query requires frontier-model capability:

Task Type	Complexity	Suitable Model
Intent classification	Low	Small (Haiku, GPT-3.5)
Entity extraction	Low	Small
Simple Q&A with context	Medium	Medium (Sonnet, GPT-4o-mini)
Multi-step reasoning	High	Large (Opus, GPT-4o)
Code generation	High	Large or specialized
Complex analysis + synthesis	Very High	Large frontier

Sending all queries to the most capable model is expensive and unnecessary. Sending all queries to the cheapest model fails on complex tasks. Routing finds the efficient frontier.

Routing Strategies#

1. Rule-Based Routing#

Explicit rules dispatch tasks based on detected properties:

from anthropic import Anthropic

client = Anthropic()

MODEL_TIERS = {
    "fast": "claude-haiku-4-5-20251001",       # Cheapest, fastest
    "balanced": "claude-sonnet-4-6",   # Middle tier
    "powerful": "claude-opus-4-6"      # Most capable, most expensive
}

def classify_query_complexity(query: str) -> str:
    """Rule-based routing by query characteristics."""
    query_lower = query.lower()
    words = query.split()

    # Simple classification tasks → fast model
    if len(words) < 10 and any(kw in query_lower for kw in
                                ["classify", "label", "is this", "yes or no", "true or false"]):
        return "fast"

    # Complex reasoning indicators → powerful model
    complex_indicators = [
        "analyze", "synthesize", "compare and contrast",
        "provide a comprehensive", "step by step", "design",
        "explain why", "what are the implications"
    ]
    if any(indicator in query_lower for indicator in complex_indicators):
        return "powerful"

    # Length-based: long queries often require more reasoning
    if len(words) > 100:
        return "powerful"

    # Default: balanced tier
    return "balanced"

def route_query(query: str, system_prompt: str = "") -> str:
    """Route query to appropriate model tier."""
    tier = classify_query_complexity(query)
    model = MODEL_TIERS[tier]

    response = client.messages.create(
        model=model,
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )

    print(f"Routed to: {tier} ({model})")
    return response.content[0].text

2. Semantic Routing#

Use embeddings to detect query intent and route to specialized model:

from anthropic import Anthropic

client = Anthropic()

# Define routing categories with example queries
ROUTING_EXAMPLES = {
    "code": [
        "write a function that sorts a list",
        "debug this Python code",
        "explain what this regex does",
        "how do I implement authentication in FastAPI"
    ],
    "analysis": [
        "analyze the key trends in this data",
        "what are the strategic implications of this decision",
        "compare these two approaches"
    ],
    "extraction": [
        "extract all email addresses from this text",
        "what is the company name mentioned here",
        "list all the dates in this document"
    ]
}

MODEL_FOR_TASK = {
    "code": "claude-sonnet-4-6",
    "analysis": "claude-opus-4-6",
    "extraction": "claude-haiku-4-5-20251001"
}

def semantic_route(query: str) -> str:
    """Use a fast model to classify and route the query."""
    # Use cheap model to classify
    classification = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=10,
        messages=[{
            "role": "user",
            "content": f"""Classify this query as one of: code, analysis, extraction
Query: {query}
Respond with only the category name."""
        }]
    )

    category = classification.content[0].text.strip().lower()
    if category not in MODEL_FOR_TASK:
        category = "analysis"  # Default to most capable for unknown types

    target_model = MODEL_FOR_TASK[category]

    # Execute on appropriate model
    response = client.messages.create(
        model=target_model,
        max_tokens=2048,
        messages=[{"role": "user", "content": query}]
    )

    return response.content[0].text

3. Cascading Routing (Quality-Gated Escalation)#

Start with a cheap model; escalate if the response confidence is insufficient:

from anthropic import Anthropic
import json

client = Anthropic()

def cascading_route(query: str,
                    confidence_threshold: float = 0.8) -> dict:
    """Try cheap model first; escalate if confidence too low."""

    # Tier 1: Try fast model with confidence self-assessment
    tier1_response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1500,
        messages=[{
            "role": "user",
            "content": f"""{query}

After your response, assess your confidence on a 0-1 scale.
Format: {{"answer": "...", "confidence": 0.0-1.0, "reason": "..."}}"""
        }]
    )

    try:
        result = json.loads(tier1_response.content[0].text)
        confidence = result.get("confidence", 0.5)
    except json.JSONDecodeError:
        # If JSON parse fails, treat as low confidence
        result = {"answer": tier1_response.content[0].text}
        confidence = 0.5

    # If confidence sufficient, return tier 1 result
    if confidence >= confidence_threshold:
        return {
            "answer": result.get("answer", ""),
            "model_used": "claude-haiku-4-5-20251001",
            "confidence": confidence,
            "escalated": False
        }

    # Escalate to more capable model
    tier2_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": query}]
    )

    return {
        "answer": tier2_response.content[0].text,
        "model_used": "claude-opus-4-6",
        "confidence": 1.0,  # Assume high confidence from powerful model
        "escalated": True,
        "escalation_reason": result.get("reason", "Low confidence from tier 1")
    }

4. Cost-Aware Routing with LiteLLM#

LiteLLM provides unified access to 100+ models with built-in routing:

from litellm import completion

def cost_aware_route(query: str, max_cost_per_1k_tokens: float = 0.01):
    """Route to highest-quality model within cost budget."""

    # LiteLLM model catalog with cost tiers
    models_by_cost = [
        {"model": "claude-haiku-4-5-20251001", "cost": 0.001},  # Cheapest
        {"model": "claude-sonnet-4-6", "cost": 0.005},
        {"model": "claude-opus-4-6", "cost": 0.015}  # Most expensive
    ]

    # Select best model within budget
    selected = next(
        (m for m in reversed(models_by_cost) if m["cost"] <= max_cost_per_1k_tokens),
        models_by_cost[0]  # Fallback to cheapest
    )

    response = completion(
        model=selected["model"],
        messages=[{"role": "user", "content": query}]
    )

    return {
        "response": response.choices[0].message.content,
        "model": selected["model"],
        "estimated_cost_per_1k": selected["cost"]
    }

LLM Routing in Agent Workflows#

In multi-step agents, different steps often warrant different models:

class RoutedAgent:
    """Agent that routes different steps to appropriate models."""

    STEP_MODELS = {
        "planning": "claude-opus-4-6",         # Complex reasoning
        "tool_call_decision": "claude-sonnet-4-6",  # Moderate reasoning
        "result_classification": "claude-haiku-4-5-20251001",  # Simple classification
        "summarization": "claude-sonnet-4-6",  # Moderate capability
        "final_synthesis": "claude-opus-4-6"  # Complex integration
    }

    def run_step(self, step_type: str, prompt: str) -> str:
        model = self.STEP_MODELS.get(step_type, "claude-sonnet-4-6")
        response = client.messages.create(
            model=model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

Common Misconceptions#

Agent Runtime — The execution infrastructure implementing routing decisions
Agent SDK — SDKs like LiteLLM that provide routing abstractions
Context Window — Context limits that influence model selection
Tool Calling — Tool-calling capability differences across model tiers
Agentic Workflow — Multi-step workflows where per-step routing reduces costs
Build Your First AI Agent — Tutorial covering model selection in agent design
LangChain vs AutoGen — How frameworks handle model routing across tasks

What Is LLM Routing?

Term Snapshot

What Is LLM Routing?

Quick Definition#

Why LLM Routing Matters#

Routing Strategies#

1. Rule-Based Routing#

2. Semantic Routing#

3. Cascading Routing (Quality-Gated Escalation)#

4. Cost-Aware Routing with LiteLLM#

LLM Routing in Agent Workflows#

Common Misconceptions#

Frequently Asked Questions#

What is LLM routing?#

What are the main approaches to LLM routing?#

When should I use LLM routing vs always using the best model?#

How accurate does LLM routing need to be?#

What Is LLM Routing?

Term Snapshot

What Is LLM Routing?

Quick Definition#

Why LLM Routing Matters#

Routing Strategies#

1. Rule-Based Routing#

2. Semantic Routing#

3. Cascading Routing (Quality-Gated Escalation)#

4. Cost-Aware Routing with LiteLLM#

LLM Routing in Agent Workflows#

Common Misconceptions#

Frequently Asked Questions#

What is LLM routing?#

What are the main approaches to LLM routing?#

When should I use LLM routing vs always using the best model?#

How accurate does LLM routing need to be?#

Term Snapshot

What Is LLM Routing?

Quick Definition#

Why LLM Routing Matters#

Routing Strategies#

1. Rule-Based Routing#

2. Semantic Routing#

3. Cascading Routing (Quality-Gated Escalation)#

4. Cost-Aware Routing with LiteLLM#

LLM Routing in Agent Workflows#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is LLM routing?#

What are the main approaches to LLM routing?#

When should I use LLM routing vs always using the best model?#

How accurate does LLM routing need to be?#

Term Snapshot

What Is LLM Routing?

Quick Definition#

Why LLM Routing Matters#

Routing Strategies#

1. Rule-Based Routing#

2. Semantic Routing#

3. Cascading Routing (Quality-Gated Escalation)#

4. Cost-Aware Routing with LiteLLM#

LLM Routing in Agent Workflows#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is LLM routing?#

What are the main approaches to LLM routing?#

When should I use LLM routing vs always using the best model?#

How accurate does LLM routing need to be?#