What is error recovery in AI agents?

Error recovery in AI agents encompasses all the mechanisms by which agents detect that something went wrong, decide how to respond, and either fix the problem autonomously or escalate for human intervention. This includes retry logic, fallback strategies, state checkpointing, and graceful degradation.

What types of errors do AI agents encounter?

Agents encounter tool failures (API timeouts, authentication errors), LLM errors (context length exceeded, refusals, rate limits), logical errors (wrong tool called, bad arguments), state corruption (unexpected intermediate outputs), and external dependency failures (third-party services unavailable).

How should an agent handle a tool that fails repeatedly?

After a defined number of retries (typically 2–3) with exponential backoff, the agent should either fall back to an alternative approach (if one exists), report failure to the caller with a clear error message, or escalate to human review. Never retry indefinitely — cap retries and fail explicitly.

What is the difference between retry and fallback in agent error recovery?

Retry means attempting the same action again, typically after a delay. Fallback means switching to an alternative approach when the primary approach fails — for example, using a secondary API when the primary API is unavailable, or using a different tool that accomplishes the same goal.

Code on screen with error highlighting representing debugging and error resolution in AI agents — Photo by Markus Spiske on Unsplash

What Is Agent Error Recovery?

Agent error recovery encompasses all the mechanisms that enable AI agents to handle failures gracefully — detecting when something goes wrong, deciding how to respond, and either resolving the problem autonomously or escalating to human intervention. In production environments, errors are not exceptional events. They are routine. APIs time out, models produce unparseable output, authentication tokens expire, and external services become temporarily unavailable. Agents that lack robust error recovery become unreliable, unpredictable, and require constant human supervision to prevent cascading failures.

Error recovery is what separates a research prototype from a production-ready agent. Building it well requires thinking systematically about what can fail, how failure manifests, and what the appropriate response is in each case.

For related topics, see Agent Observability, Agent Deployment Patterns, and Human-in-the-Loop AI. Browse all AI agent concepts in the glossary or see practical error handling in deployment tutorials.

Types of Errors AI Agents Encounter#

Tool and API Failures#

External dependencies fail. Common patterns:

Network timeouts: The tool call starts but doesn't complete within the expected time
Rate limit errors: The API returns 429 Too Many Requests because the agent is calling too frequently
Authentication failures: API keys expire, tokens are revoked, or credential rotation fails
Service unavailability: Third-party APIs return 503 Service Unavailable during outages
Invalid responses: The API returns a response the agent's parser can't process

LLM Errors#

The language model itself can fail in several ways:

Context length exceeded: The input is too long for the model's context window
Rate limits: The LLM provider throttles requests
Refusals: The model declines to perform a requested action for safety reasons
Malformed output: The model produces output that doesn't match the expected format (JSON parse errors, missing required fields)
Hallucinated tool calls: The model calls tools that don't exist or with incorrect arguments

Logical Errors#

The agent takes incorrect steps:

Wrong tool selected: The agent chooses an inappropriate tool for the task
Invalid arguments: Tool is called with arguments that fail validation
Incorrect sequencing: Steps are executed in the wrong order, producing invalid state
Goal drift: The agent pursues a subtask in a way that undermines the main goal

State Errors#

Problems with agent state management:

Corrupted intermediate results: An earlier step produced output that later steps can't use
Lost context: Critical context about the task goal is lost or misinterpreted
Circular loops: The agent repeatedly attempts the same failing action

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

For transient failures (network timeouts, rate limits), retry the same action after an increasing delay:

import time

def call_tool_with_retry(tool, args, max_retries=3):
    for attempt in range(max_retries):
        try:
            return tool(args)
        except RateLimitError:
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            time.sleep(wait_time)
        except TimeoutError:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
    raise MaxRetriesExceeded(f"Tool failed after {max_retries} attempts")

Key principles:

Exponential backoff avoids hammering a rate-limited service
Cap retries — don't retry indefinitely
Only retry idempotent operations (safe to repeat without side effects)
Log each retry for debugging

2. Fallback Strategies#

When the primary approach fails, switch to an alternative:

Alternative tool: If search API A fails, try search API B.

Degraded mode: If real-time data is unavailable, use cached or estimated data and signal uncertainty in the response.

Simplified approach: If a complex multi-step workflow fails, fall back to a simpler single-step approximation.

Human escalation: If all automated approaches fail, route to a human reviewer.

3. Output Validation and Correction#

Validate LLM outputs before using them:

def get_structured_output(prompt, expected_schema):
    for attempt in range(3):
        output = llm.generate(prompt)
        try:
            parsed = json.loads(output)
            validated = expected_schema.validate(parsed)
            return validated
        except (json.JSONDecodeError, ValidationError) as e:
            prompt += f"\n\nYour previous response was invalid: {e}. Please try again and ensure your response is valid JSON matching the required schema."
    raise OutputValidationFailed("Could not get valid output after 3 attempts")

Including the error message in the retry prompt often helps the model self-correct.

4. Checkpointing and Resumption#

For long-running tasks, save intermediate state so failed tasks can resume from the last successful step rather than starting over:

class AgentTask:
    def __init__(self, task_id, steps):
        self.task_id = task_id
        self.steps = steps
        self.completed_steps = self.load_checkpoint()

    def execute(self):
        for step in self.steps:
            if step.name in self.completed_steps:
                continue  # Skip already-completed steps

            result = step.execute()
            self.save_checkpoint(step.name, result)

    def save_checkpoint(self, step_name, result):
        # Persist to Redis, database, or file
        checkpoint_store.set(f"{self.task_id}:{step_name}", result)

Checkpointing is essential for agents running tasks that take minutes or hours — a failure at step 8 of 10 shouldn't require rerunning steps 1–7.

5. Human Escalation#

Define clear escalation conditions and routing:

Confidence threshold: Escalate when the agent's confidence in its planned action falls below a threshold
Repeated failures: Escalate after N failed attempts at a task
Scope violation: Escalate when the task requires actions outside the agent's authorized scope
Ambiguity resolution: Escalate when the task description is too ambiguous to proceed safely

Escalation should include context — what the agent was trying to do, what went wrong, and what information would help a human resolve it.

Error Classification and Response Matrix#

Error Type	Severity	Recommended Response
Network timeout	Low	Retry with backoff (max 3x)
API rate limit	Medium	Retry with exponential backoff + jitter
Auth failure	High	Alert operator, pause task
Invalid LLM output	Low	Retry with corrective prompt
Context length exceeded	Medium	Truncate input or summarize context
Tool not found	High	Escalate, agent configuration issue
Goal-ambiguous instruction	Medium	Request clarification or escalate
Repeated step failure	High	Escalate to human review

Error Recovery in Multi-Agent Systems#

In multi-agent architectures, errors compound. An orchestrator agent must handle:

Subagent failures (a delegated task failed)
Partial completion (some subtasks succeeded, others failed)
Inconsistent state across agents (different agents have different views of the world)

Recovery patterns for multi-agent systems:

Compensating transactions: When a downstream step fails, execute rollback actions to undo earlier steps
Saga pattern: Define compensating actions upfront for each step so any failure can be cleanly unwound
Circuit breaker: If a subagent fails repeatedly, stop delegating to it and route work elsewhere

Observability and Error Learning#

Error recovery is most effective when paired with strong observability. Logging every error with full context — what tool was called, what the arguments were, what error was returned, what recovery action was taken — enables:

Root cause analysis: Identifying which errors are most frequent and most impactful
Recovery optimization: Measuring whether retry attempts succeed or consistently fail
Proactive prevention: Catching error patterns before they become critical incidents
Continuous improvement: Using error data to improve tool implementations, prompts, and agent logic

See Agent Tracing for implementation guidance.

Agent Observability — Monitoring agent behavior in production
Agent Tracing — Recording execution for debugging
Human-in-the-Loop — Human oversight and intervention patterns
Agent Deployment Patterns — Infrastructure approaches for production agents
Best AI Agent Deployment Platforms — Production hosting options for resilient agents
AI Agent Tutorials — Step-by-step guides including error handling patterns

Frequently Asked Questions#

What is the most important aspect of AI agent error recovery? Failing explicitly and visibly. Agents that silently continue after errors produce incorrect, unpredictable results. An agent that clearly reports what failed and why is far easier to debug and improve than one that masks failures.

How do I prevent agents from getting stuck in error loops? Cap retries at a defined maximum (2–3 for most cases), track retry count in state, and after exhausting retries, either fall back to an alternative approach or escalate explicitly. Never allow unbounded retry loops.

Should agents always retry when a tool fails? Only retry idempotent operations — actions that can safely be repeated without side effects. Creating records, sending messages, or making payments should not be retried without checking whether the original attempt succeeded. Database inserts and API calls that produce records require deduplication logic before retrying.

How do I test error recovery logic? Deliberately inject failures in your test environment: mock APIs to return errors, impose artificial rate limits, generate malformed LLM outputs. Test that retries behave correctly, fallbacks trigger, and escalation paths work as expected. Chaos testing for agents follows the same principles as chaos engineering for distributed systems.

What Is Agent Error Recovery?

For related topics, see Agent Observability, Agent Deployment Patterns, and Human-in-the-Loop AI. Browse all AI agent concepts in the glossary or see practical error handling in deployment tutorials.

Types of Errors AI Agents Encounter#

Tool and API Failures#

External dependencies fail. Common patterns:

Network timeouts: The tool call starts but doesn't complete within the expected time
Rate limit errors: The API returns 429 Too Many Requests because the agent is calling too frequently
Authentication failures: API keys expire, tokens are revoked, or credential rotation fails
Service unavailability: Third-party APIs return 503 Service Unavailable during outages
Invalid responses: The API returns a response the agent's parser can't process

LLM Errors#

The language model itself can fail in several ways:

Context length exceeded: The input is too long for the model's context window
Rate limits: The LLM provider throttles requests
Refusals: The model declines to perform a requested action for safety reasons
Malformed output: The model produces output that doesn't match the expected format (JSON parse errors, missing required fields)
Hallucinated tool calls: The model calls tools that don't exist or with incorrect arguments

Logical Errors#

The agent takes incorrect steps:

Wrong tool selected: The agent chooses an inappropriate tool for the task
Invalid arguments: Tool is called with arguments that fail validation
Incorrect sequencing: Steps are executed in the wrong order, producing invalid state
Goal drift: The agent pursues a subtask in a way that undermines the main goal

State Errors#

Problems with agent state management:

Corrupted intermediate results: An earlier step produced output that later steps can't use
Lost context: Critical context about the task goal is lost or misinterpreted
Circular loops: The agent repeatedly attempts the same failing action

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

For transient failures (network timeouts, rate limits), retry the same action after an increasing delay:

import time

def call_tool_with_retry(tool, args, max_retries=3):
    for attempt in range(max_retries):
        try:
            return tool(args)
        except RateLimitError:
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            time.sleep(wait_time)
        except TimeoutError:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
    raise MaxRetriesExceeded(f"Tool failed after {max_retries} attempts")

Key principles:

Exponential backoff avoids hammering a rate-limited service
Cap retries — don't retry indefinitely
Only retry idempotent operations (safe to repeat without side effects)
Log each retry for debugging

2. Fallback Strategies#

When the primary approach fails, switch to an alternative:

Alternative tool: If search API A fails, try search API B.

Degraded mode: If real-time data is unavailable, use cached or estimated data and signal uncertainty in the response.

Simplified approach: If a complex multi-step workflow fails, fall back to a simpler single-step approximation.

Human escalation: If all automated approaches fail, route to a human reviewer.

3. Output Validation and Correction#

Validate LLM outputs before using them:

def get_structured_output(prompt, expected_schema):
    for attempt in range(3):
        output = llm.generate(prompt)
        try:
            parsed = json.loads(output)
            validated = expected_schema.validate(parsed)
            return validated
        except (json.JSONDecodeError, ValidationError) as e:
            prompt += f"\n\nYour previous response was invalid: {e}. Please try again and ensure your response is valid JSON matching the required schema."
    raise OutputValidationFailed("Could not get valid output after 3 attempts")

Including the error message in the retry prompt often helps the model self-correct.

4. Checkpointing and Resumption#

For long-running tasks, save intermediate state so failed tasks can resume from the last successful step rather than starting over:

class AgentTask:
    def __init__(self, task_id, steps):
        self.task_id = task_id
        self.steps = steps
        self.completed_steps = self.load_checkpoint()

    def execute(self):
        for step in self.steps:
            if step.name in self.completed_steps:
                continue  # Skip already-completed steps

            result = step.execute()
            self.save_checkpoint(step.name, result)

    def save_checkpoint(self, step_name, result):
        # Persist to Redis, database, or file
        checkpoint_store.set(f"{self.task_id}:{step_name}", result)

Checkpointing is essential for agents running tasks that take minutes or hours — a failure at step 8 of 10 shouldn't require rerunning steps 1–7.

5. Human Escalation#

Define clear escalation conditions and routing:

Confidence threshold: Escalate when the agent's confidence in its planned action falls below a threshold
Repeated failures: Escalate after N failed attempts at a task
Scope violation: Escalate when the task requires actions outside the agent's authorized scope
Ambiguity resolution: Escalate when the task description is too ambiguous to proceed safely

Escalation should include context — what the agent was trying to do, what went wrong, and what information would help a human resolve it.

Error Classification and Response Matrix#

Error Type	Severity	Recommended Response
Network timeout	Low	Retry with backoff (max 3x)
API rate limit	Medium	Retry with exponential backoff + jitter
Auth failure	High	Alert operator, pause task
Invalid LLM output	Low	Retry with corrective prompt
Context length exceeded	Medium	Truncate input or summarize context
Tool not found	High	Escalate, agent configuration issue
Goal-ambiguous instruction	Medium	Request clarification or escalate
Repeated step failure	High	Escalate to human review

Error Recovery in Multi-Agent Systems#

In multi-agent architectures, errors compound. An orchestrator agent must handle:

Subagent failures (a delegated task failed)
Partial completion (some subtasks succeeded, others failed)
Inconsistent state across agents (different agents have different views of the world)

Recovery patterns for multi-agent systems:

Compensating transactions: When a downstream step fails, execute rollback actions to undo earlier steps
Saga pattern: Define compensating actions upfront for each step so any failure can be cleanly unwound
Circuit breaker: If a subagent fails repeatedly, stop delegating to it and route work elsewhere

Observability and Error Learning#

Root cause analysis: Identifying which errors are most frequent and most impactful
Recovery optimization: Measuring whether retry attempts succeed or consistently fail
Proactive prevention: Catching error patterns before they become critical incidents
Continuous improvement: Using error data to improve tool implementations, prompts, and agent logic

See Agent Tracing for implementation guidance.

Agent Observability — Monitoring agent behavior in production
Agent Tracing — Recording execution for debugging
Human-in-the-Loop — Human oversight and intervention patterns
Agent Deployment Patterns — Infrastructure approaches for production agents
Best AI Agent Deployment Platforms — Production hosting options for resilient agents
AI Agent Tutorials — Step-by-step guides including error handling patterns

What Is Agent Error Recovery?

Term Snapshot

What Is Agent Error Recovery?

Types of Errors AI Agents Encounter#

Tool and API Failures#

LLM Errors#

Logical Errors#

State Errors#

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

2. Fallback Strategies#

3. Output Validation and Correction#

4. Checkpointing and Resumption#

5. Human Escalation#

Error Classification and Response Matrix#

Error Recovery in Multi-Agent Systems#

Observability and Error Learning#

Frequently Asked Questions#

What Is Agent Error Recovery?

Term Snapshot

What Is Agent Error Recovery?

Types of Errors AI Agents Encounter#

Tool and API Failures#

LLM Errors#

Logical Errors#

State Errors#

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

2. Fallback Strategies#

3. Output Validation and Correction#

4. Checkpointing and Resumption#

5. Human Escalation#

Error Classification and Response Matrix#

Error Recovery in Multi-Agent Systems#

Observability and Error Learning#

Frequently Asked Questions#

Term Snapshot

What Is Agent Error Recovery?

Types of Errors AI Agents Encounter#

Tool and API Failures#

LLM Errors#

Logical Errors#

State Errors#

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

2. Fallback Strategies#

3. Output Validation and Correction#

4. Checkpointing and Resumption#

5. Human Escalation#

Error Classification and Response Matrix#

Error Recovery in Multi-Agent Systems#

Observability and Error Learning#

Related Terms#

Frequently Asked Questions#

Term Snapshot

What Is Agent Error Recovery?

Types of Errors AI Agents Encounter#

Tool and API Failures#

LLM Errors#

Logical Errors#

State Errors#

Core Error Recovery Patterns#

1. Retry with Exponential Backoff#

2. Fallback Strategies#

3. Output Validation and Correction#

4. Checkpointing and Resumption#

5. Human Escalation#

Error Classification and Response Matrix#

Error Recovery in Multi-Agent Systems#

Observability and Error Learning#

Related Terms#

Frequently Asked Questions#