Introduction#
Testing AI agents is fundamentally different from testing conventional software. Traditional tests are deterministic: given the same input, you always get the same output. AI agents are probabilistic by nature — the same prompt can produce different responses across runs, and the correct behavior is often a matter of degree rather than a binary pass or fail.
This guide introduces a structured testing approach adapted for AI agents. It covers the full testing pyramid — unit tests for individual tools, integration tests for agent workflows, and behavioral evals for end-to-end quality — along with practical guidance on mocking LLM calls, running tests in CI/CD, and detecting regressions over time.
Before diving into testing, make sure your agent is properly structured. Explore our tutorials index or read how to build an AI agent from scratch if you are starting from zero.
Why Agent Testing Requires a Different Mental Model#
With traditional software, a test failure means the code is wrong. With agents, a test failure can mean several different things:
- The tool implementation has a bug (deterministic, fixable)
- The agent's reasoning is flawed (probabilistic, prompt-dependent)
- The LLM produced an unusual output this run (variance, not a bug)
- Your test expectation is too strict (calibration issue)
- The behavior genuinely regressed after a model or prompt change (real regression)
Effective agent testing learns to distinguish between these categories rather than treating every failure identically. The right response to each is different: fix the code, update the prompt, adjust the threshold, or tighten the eval criteria.
For context on what metrics to track beyond binary pass/fail, see the agent evaluation guide.
Prerequisites#
- A working AI agent with at least one tool defined
- Basic familiarity with your language's testing framework (pytest, Jest, etc.)
- Access to your LLM provider's API for live tests, or a mock library for unit tests
- Understanding of the agent loop and how your agent processes steps
Step 1: Unit Test Individual Tools First#
The most reliable tests you can write for an agent are unit tests for its tools. Tools are ordinary functions — they have defined inputs, defined outputs, and deterministic behavior. They test just like any other code.
What to Test in Tool Unit Tests#
For each tool, write tests that cover:
- Happy path: Given valid inputs, does the tool return the expected output?
- Input validation: Does the tool reject invalid inputs with clear error messages?
- Edge cases: Empty strings, null values, maximum lengths, special characters.
- Error handling: Does the tool handle downstream failures (API timeouts, 404s) gracefully?
- Output schema: Does the tool always return the expected schema, even on partial failures?
# Example: unit test for a search tool
def test_search_tool_returns_results():
results = search_tool(query="AI agents")
assert len(results) > 0
assert all("title" in r and "url" in r for r in results)
def test_search_tool_rejects_empty_query():
with pytest.raises(ValueError, match="query cannot be empty"):
search_tool(query="")
def test_search_tool_handles_api_timeout(mock_http_client):
mock_http_client.side_effect = TimeoutError()
result = search_tool(query="test")
assert result == [] # graceful fallback, not exception
Tools should be written to be easily unit testable — pure functions with injectable dependencies wherever possible. If your tool makes HTTP calls, the HTTP client should be injectable so tests can mock it without hitting real APIs.
Step 2: Integration Test Agent Workflows With Mocked LLM Calls#
Once your tools are unit tested, the next layer is integration testing: verifying that your agent correctly orchestrates tool calls for a given task. The key technique here is mocking the LLM.
Why Mock the LLM?#
Running tests against a live LLM introduces three problems:
- Cost: Even small test suites can cost meaningful money at scale.
- Speed: LLM calls add seconds to each test, making CI slow.
- Variance: The LLM may return different tool call sequences across runs, making tests flaky.
For integration tests, you want to test your agent's orchestration logic, not the LLM's intelligence. Mock the LLM to return specific, deterministic tool call sequences and verify that your agent executes them correctly.

Mocking Patterns#
Most frameworks provide mock utilities. In LangChain, you can replace the LLM with a FakeLLM that returns scripted responses. In AutoGen (see our LangChain vs AutoGen comparison for context on each), you can intercept the completion call.
# Pseudocode: integration test with mocked LLM
def test_research_agent_calls_search_then_summarize():
mock_llm = FakeLLM(responses=[
# First call: LLM decides to search
ToolCallResponse(tool="search", args={"query": "AI agent testing"}),
# Second call: LLM summarizes results
TextResponse("AI agent testing involves unit tests, integration tests..."),
])
agent = ResearchAgent(llm=mock_llm, tools=[search_tool, summarize_tool])
result = agent.run("Research best practices for AI agent testing")
assert mock_llm.call_count == 2
assert "testing" in result.lower()
Integration tests with mocked LLMs are fast, cheap, and deterministic. They catch bugs in your orchestration logic, tool calling syntax, and output parsing — without depending on LLM variability.
Step 3: Write Deterministic Behavioral Tests#
Some agent behaviors can be tested deterministically even with a live LLM, by focusing on structural properties of the output rather than exact content.
Structural Output Tests#
If your agent produces structured output, test the structure, not the content:
- Does the response always include the required fields?
- Are numeric fields within valid ranges?
- Does the response schema match the expected type definitions?
- Does the agent always produce valid JSON when asked to?
These tests pass or fail reliably regardless of LLM variance. They catch regressions in output format that indicate something broke in your output parsing or prompt.
Behavioral Invariants#
Some properties should always hold regardless of input:
- The agent should never make more than N tool calls on a single task (loop detection).
- The agent should always return within a timeout.
- The agent should never call a destructive tool without first calling a read tool.
- The agent's response should always be in the requested language.
Write tests that assert these invariants hold across a random sample of your input space. Run them in CI to catch regressions.
Step 4: Build Your Behavioral Eval Suite#
Beyond deterministic tests, you need behavioral evals: a labeled dataset of tasks and expected outcomes that you run against a live LLM to measure real-world quality.
Anatomy of a Behavioral Eval#
Each eval example has:
- Input: The task or user message
- Expected behavior: What the agent should do (described in rubric form, not exact text)
- Scorer: How to measure whether the actual output matches expected behavior
# Example eval entry
- input: "Summarize the main risks in this document: [document]"
expected_behavior:
- identifies at least 3 distinct risks
- each risk includes a description and potential impact
- no hallucinated risks not present in the document
scorer: llm_judge
scorer_model: gpt-4o
Calibrating Pass/Fail Thresholds#
Because evals involve probabilistic scoring, you need thresholds. A common approach: run each eval example 3 times and take the majority outcome. Set your overall pass threshold at a completion rate that your business requires — for most production agents, 85% is a reasonable starting floor.
See the agent evaluation guide for a full treatment of metric selection and threshold setting.
Step 5: Test for Failure Modes Specifically#
The most valuable tests are often the ones that probe failure modes rather than the happy path. For AI agents, the most important failure modes to test are:
Hallucination resistance: Give the agent a task that requires information not in its context. Does it appropriately say it does not know, or does it fabricate an answer?
Tool call error recovery: Simulate a tool returning an error. Does the agent handle it gracefully, retry appropriately, or loop indefinitely?
Context window handling: Give the agent a task with very long inputs that approach your model's context limit. Does it truncate gracefully or break?
Adversarial inputs: Test inputs designed to confuse or manipulate the agent (basic prompt injection attempts). Does the agent stay on task?
Multi-step coherence: For tasks requiring many steps, does the agent maintain coherent state across the full sequence?
Write explicit tests for each of these. They will catch problems that happy path testing misses entirely.
Step 6: Integrate Testing Into CI/CD#
Tests only catch regressions if they run automatically on every change. Build a tiered CI pipeline:
On every pull request (fast, cheap):
- All tool unit tests
- Integration tests with mocked LLM
- Structural output tests
On merge to main (slower, more thorough):
- Full behavioral eval suite against live LLM
- Failure mode tests
- Performance and latency checks
Weekly scheduled run:
- Regression test against the previous week's baseline
- Compare metric trends over time
Set explicit failure thresholds in your CI configuration. If task completion rate drops by more than 2 percentage points from the last passing run on main, the build should fail.
For multi-agent systems that multiply the testing complexity, see the LangGraph multi-agent tutorial.
Common Mistakes in AI Agent Testing#
Only testing the happy path. Your users will find every edge case. Write failure mode tests before they do.
Using a live LLM for every test. This makes your test suite slow, expensive, and flaky. Reserve live LLM calls for behavioral evals and run everything else with mocked responses.
Setting expectations on exact output text. Exact match tests for natural language outputs will break on every model or prompt update. Test structural properties and semantic correctness instead.
No regression tracking. Running evals once before deployment and never again means you will miss gradual quality drift. Trend tracking is as important as point-in-time measurement.
Testing the LLM instead of your agent. You do not need to test whether GPT-4 can answer a question correctly — that is the provider's job. Test that your agent's tools, orchestration logic, and output handling work as designed.
Not testing tool failure paths. Most developers test what happens when tools succeed. The interesting bugs are in what happens when they fail.
Best Practices for AI Agent Testing#
- Write tool unit tests first — they are the most reliable tests you will have.
- Mock the LLM for integration tests to make them fast, cheap, and deterministic.
- Build a labeled eval dataset from real production failures, not just synthetic examples.
- Test structural properties and invariants, not exact text outputs.
- Run a tiered CI pipeline: fast deterministic tests on every PR, full evals on merge.
- Track eval metrics over time to detect gradual drift, not just sudden regressions.
- Include explicit failure mode tests for hallucination, error recovery, and adversarial inputs.
Conclusion#
Testing AI agents is harder than testing conventional software, but the same engineering discipline applies: test at multiple levels, automate everything you can, and build systems that tell you when something breaks before your users do.
Start with tool unit tests — they are deterministic and give you the most confidence per hour of investment. Layer in integration tests with mocked LLMs for orchestration coverage. Then add behavioral evals for live quality measurement. Wire everything into CI/CD so regressions surface automatically.
Explore the tutorials index for related guides on deployment, evaluation, and security.
Frequently Asked Questions#
Should I test with a real LLM or a mocked LLM?
Use both, but for different test types. Unit tests and integration tests should use mocked LLM responses — this keeps them fast, cheap, and deterministic. Behavioral evals should use a real LLM because you are measuring actual quality, not just orchestration logic. Never use a live LLM for tests that can be made deterministic with mocking.
How do I handle flaky tests caused by LLM variance?
For deterministic tests using mocked LLMs, flakiness indicates a bug in your code, not LLM variance — fix the code. For behavioral evals using live LLMs, run each example multiple times and score the majority outcome. If an eval is consistently flaky (fails 40-60% of the time), the behavior is genuinely ambiguous and you need to clarify the expected outcome.
What is the minimum test coverage I need before deploying an agent to production?
At minimum: 100% tool unit test coverage for happy paths and common error cases, integration tests for every major workflow the agent handles, and a behavioral eval set of at least 50 labeled examples covering your core use cases. This is the floor, not the ceiling.
How do I test multi-agent systems where agents call each other?
Test each agent independently first, treating calls to other agents as tool calls you can mock. Then write integration tests for specific inter-agent workflows with both agents live. Finally, run end-to-end behavioral evals on the full system. The testing complexity compounds with each agent you add, so invest in good mocking infrastructure early.