What Is Agent Error Recovery?
Agent error recovery encompasses all the mechanisms that enable AI agents to handle failures gracefully — detecting when something goes wrong, deciding how to respond, and either resolving the problem autonomously or escalating to human intervention. In production environments, errors are not exceptional events. They are routine. APIs time out, models produce unparseable output, authentication tokens expire, and external services become temporarily unavailable. Agents that lack robust error recovery become unreliable, unpredictable, and require constant human supervision to prevent cascading failures.
Error recovery is what separates a research prototype from a production-ready agent. Building it well requires thinking systematically about what can fail, how failure manifests, and what the appropriate response is in each case.
For related topics, see Agent Observability, Agent Deployment Patterns, and Human-in-the-Loop AI. Browse all AI agent concepts in the glossary or see practical error handling in deployment tutorials.
Types of Errors AI Agents Encounter#
Tool and API Failures#
External dependencies fail. Common patterns:
- Network timeouts: The tool call starts but doesn't complete within the expected time
- Rate limit errors: The API returns 429 Too Many Requests because the agent is calling too frequently
- Authentication failures: API keys expire, tokens are revoked, or credential rotation fails
- Service unavailability: Third-party APIs return 503 Service Unavailable during outages
- Invalid responses: The API returns a response the agent's parser can't process
LLM Errors#
The language model itself can fail in several ways:
- Context length exceeded: The input is too long for the model's context window
- Rate limits: The LLM provider throttles requests
- Refusals: The model declines to perform a requested action for safety reasons
- Malformed output: The model produces output that doesn't match the expected format (JSON parse errors, missing required fields)
- Hallucinated tool calls: The model calls tools that don't exist or with incorrect arguments
Logical Errors#
The agent takes incorrect steps:
- Wrong tool selected: The agent chooses an inappropriate tool for the task
- Invalid arguments: Tool is called with arguments that fail validation
- Incorrect sequencing: Steps are executed in the wrong order, producing invalid state
- Goal drift: The agent pursues a subtask in a way that undermines the main goal
State Errors#
Problems with agent state management:
- Corrupted intermediate results: An earlier step produced output that later steps can't use
- Lost context: Critical context about the task goal is lost or misinterpreted
- Circular loops: The agent repeatedly attempts the same failing action
Core Error Recovery Patterns#
1. Retry with Exponential Backoff#
For transient failures (network timeouts, rate limits), retry the same action after an increasing delay:
import time
def call_tool_with_retry(tool, args, max_retries=3):
for attempt in range(max_retries):
try:
return tool(args)
except RateLimitError:
wait_time = 2 ** attempt # 1, 2, 4 seconds
time.sleep(wait_time)
except TimeoutError:
if attempt == max_retries - 1:
raise
time.sleep(1)
raise MaxRetriesExceeded(f"Tool failed after {max_retries} attempts")
Key principles:
- Exponential backoff avoids hammering a rate-limited service
- Cap retries — don't retry indefinitely
- Only retry idempotent operations (safe to repeat without side effects)
- Log each retry for debugging
2. Fallback Strategies#
When the primary approach fails, switch to an alternative:
Alternative tool: If search API A fails, try search API B.
Degraded mode: If real-time data is unavailable, use cached or estimated data and signal uncertainty in the response.
Simplified approach: If a complex multi-step workflow fails, fall back to a simpler single-step approximation.
Human escalation: If all automated approaches fail, route to a human reviewer.
3. Output Validation and Correction#
Validate LLM outputs before using them:
def get_structured_output(prompt, expected_schema):
for attempt in range(3):
output = llm.generate(prompt)
try:
parsed = json.loads(output)
validated = expected_schema.validate(parsed)
return validated
except (json.JSONDecodeError, ValidationError) as e:
prompt += f"\n\nYour previous response was invalid: {e}. Please try again and ensure your response is valid JSON matching the required schema."
raise OutputValidationFailed("Could not get valid output after 3 attempts")
Including the error message in the retry prompt often helps the model self-correct.
4. Checkpointing and Resumption#
For long-running tasks, save intermediate state so failed tasks can resume from the last successful step rather than starting over:
class AgentTask:
def __init__(self, task_id, steps):
self.task_id = task_id
self.steps = steps
self.completed_steps = self.load_checkpoint()
def execute(self):
for step in self.steps:
if step.name in self.completed_steps:
continue # Skip already-completed steps
result = step.execute()
self.save_checkpoint(step.name, result)
def save_checkpoint(self, step_name, result):
# Persist to Redis, database, or file
checkpoint_store.set(f"{self.task_id}:{step_name}", result)
Checkpointing is essential for agents running tasks that take minutes or hours — a failure at step 8 of 10 shouldn't require rerunning steps 1–7.
5. Human Escalation#
Define clear escalation conditions and routing:
- Confidence threshold: Escalate when the agent's confidence in its planned action falls below a threshold
- Repeated failures: Escalate after N failed attempts at a task
- Scope violation: Escalate when the task requires actions outside the agent's authorized scope
- Ambiguity resolution: Escalate when the task description is too ambiguous to proceed safely
Escalation should include context — what the agent was trying to do, what went wrong, and what information would help a human resolve it.
Error Classification and Response Matrix#
| Error Type | Severity | Recommended Response |
|---|---|---|
| Network timeout | Low | Retry with backoff (max 3x) |
| API rate limit | Medium | Retry with exponential backoff + jitter |
| Auth failure | High | Alert operator, pause task |
| Invalid LLM output | Low | Retry with corrective prompt |
| Context length exceeded | Medium | Truncate input or summarize context |
| Tool not found | High | Escalate, agent configuration issue |
| Goal-ambiguous instruction | Medium | Request clarification or escalate |
| Repeated step failure | High | Escalate to human review |
Error Recovery in Multi-Agent Systems#
In multi-agent architectures, errors compound. An orchestrator agent must handle:
- Subagent failures (a delegated task failed)
- Partial completion (some subtasks succeeded, others failed)
- Inconsistent state across agents (different agents have different views of the world)
Recovery patterns for multi-agent systems:
- Compensating transactions: When a downstream step fails, execute rollback actions to undo earlier steps
- Saga pattern: Define compensating actions upfront for each step so any failure can be cleanly unwound
- Circuit breaker: If a subagent fails repeatedly, stop delegating to it and route work elsewhere
Observability and Error Learning#
Error recovery is most effective when paired with strong observability. Logging every error with full context — what tool was called, what the arguments were, what error was returned, what recovery action was taken — enables:
- Root cause analysis: Identifying which errors are most frequent and most impactful
- Recovery optimization: Measuring whether retry attempts succeed or consistently fail
- Proactive prevention: Catching error patterns before they become critical incidents
- Continuous improvement: Using error data to improve tool implementations, prompts, and agent logic
See Agent Tracing for implementation guidance.
Related Terms#
- Agent Observability — Monitoring agent behavior in production
- Agent Tracing — Recording execution for debugging
- Human-in-the-Loop — Human oversight and intervention patterns
- Agent Deployment Patterns — Infrastructure approaches for production agents
- Best AI Agent Deployment Platforms — Production hosting options for resilient agents
- AI Agent Tutorials — Step-by-step guides including error handling patterns
Frequently Asked Questions#
What is the most important aspect of AI agent error recovery? Failing explicitly and visibly. Agents that silently continue after errors produce incorrect, unpredictable results. An agent that clearly reports what failed and why is far easier to debug and improve than one that masks failures.
How do I prevent agents from getting stuck in error loops? Cap retries at a defined maximum (2–3 for most cases), track retry count in state, and after exhausting retries, either fall back to an alternative approach or escalate explicitly. Never allow unbounded retry loops.
Should agents always retry when a tool fails? Only retry idempotent operations — actions that can safely be repeated without side effects. Creating records, sending messages, or making payments should not be retried without checking whether the original attempt succeeded. Database inserts and API calls that produce records require deduplication logic before retrying.
How do I test error recovery logic? Deliberately inject failures in your test environment: mock APIs to return errors, impose artificial rate limits, generate malformed LLM outputs. Test that retries behave correctly, fallbacks trigger, and escalation paths work as expected. Chaos testing for agents follows the same principles as chaos engineering for distributed systems.