AI Agent Evaluation Metrics: How to Measure Agent Performance
You cannot improve what you do not measure. AI agents that perform well in development frequently degrade in production — model updates change behavior subtly, edge cases surface, and prompts that worked in testing break on real user input. Without a systematic evaluation framework, you are flying blind.
This tutorial builds a complete AI agent evaluation system from scratch. We cover the five core metrics, build Python implementations for each, then assemble a full evaluation pipeline using LangSmith and Braintrust.
Prerequisites: Python 3.11+, a working AI agent to evaluate, basic understanding of pytest.
Related: Agent Tracing Glossary | Best AI Agent Testing Tools | AI Agent Observability with Langfuse
The Five Core Agent Evaluation Metrics#
Metric 1: Task Completion Rate#
Definition: The percentage of tasks the agent successfully completes end-to-end without human intervention.
Why it matters: This is the most fundamental measure of agent capability. An agent with 70% task completion is failing 3 in 10 users — that is a serious reliability problem in production.
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional, Callable
from datetime import datetime
import time
import json
@dataclass
class AgentRun:
"""Record of a single agent task execution."""
run_id: str
input: str
output: Optional[str]
expected_output: Optional[str]
tool_calls: List[Dict]
completed: bool
error: Optional[str]
latency_ms: float
total_tokens: int
cost_usd: float
timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
def measure_task_completion(
agent_fn: Callable,
test_cases: List[Dict],
completion_validator: Callable
) -> Dict[str, float]:
"""
Measure task completion rate across a set of test cases.
Args:
agent_fn: Function that runs the agent and returns output
test_cases: List of {'input': str, 'expected': str} dicts
completion_validator: Function(output, expected) -> bool
"""
results = []
for case in test_cases:
start_time = time.time()
try:
output = agent_fn(case["input"])
completed = completion_validator(output, case.get("expected"))
error = None
except Exception as e:
output = None
completed = False
error = str(e)
latency = (time.time() - start_time) * 1000
results.append({
"input": case["input"],
"output": output,
"completed": completed,
"error": error,
"latency_ms": latency
})
total = len(results)
completed_count = sum(1 for r in results if r["completed"])
error_count = sum(1 for r in results if r["error"])
return {
"task_completion_rate": completed_count / total,
"error_rate": error_count / total,
"total_tasks": total,
"completed_tasks": completed_count,
"failed_tasks": total - completed_count,
"avg_latency_ms": sum(r["latency_ms"] for r in results) / total,
"results": results
}
# Example completion validator
def validate_answer_completeness(output: str, expected: str) -> bool:
"""Check if output contains the key elements of the expected answer."""
if not output:
return False
# Check for key terms from expected output
expected_terms = expected.lower().split()[:5] # First 5 words as key terms
return all(term in output.lower() for term in expected_terms if len(term) > 3)
Metric 2: Tool Call Accuracy#
Definition: Whether the agent selects the correct tool and provides well-formed, correct arguments.
Why it matters: An agent that calls the wrong tool or provides malformed arguments wastes API calls, produces wrong outputs, and can cause side effects in connected systems.
from typing import List, Dict
@dataclass
class ToolCallExpectation:
"""Expected tool call specification."""
tool_name: str
required_args: List[str] # Arguments that must be present
arg_validators: Dict[str, Callable] # Arg name -> validation function
must_be_called: bool = True # Whether this tool must be called
call_order: Optional[int] = None # Expected position in call sequence
def evaluate_tool_call_accuracy(
actual_tool_calls: List[Dict],
expected_calls: List[ToolCallExpectation]
) -> Dict[str, Any]:
"""
Evaluate the accuracy of an agent's tool calling behavior.
Args:
actual_tool_calls: List of {'name': str, 'args': dict} from agent run
expected_calls: List of ToolCallExpectation specifications
"""
metrics = {
"tool_selection_accuracy": 0.0, # Correct tool selected
"argument_completeness": 0.0, # Required args present
"argument_validity": 0.0, # Args pass validation
"call_order_accuracy": 0.0, # Calls in right order
"spurious_call_rate": 0.0, # Unexpected tool calls
"details": []
}
actual_names = [c.get("name", "") for c in actual_tool_calls]
expected_names = [e.tool_name for e in expected_calls if e.must_be_called]
# Tool selection accuracy
correct_selections = sum(1 for e in expected_names if e in actual_names)
if expected_names:
metrics["tool_selection_accuracy"] = correct_selections / len(expected_names)
# Argument analysis for each expected call
arg_scores = []
validity_scores = []
for expected in expected_calls:
# Find the actual call
actual_call = next(
(c for c in actual_tool_calls if c.get("name") == expected.tool_name),
None
)
call_detail = {
"expected_tool": expected.tool_name,
"called": actual_call is not None,
"arg_completeness": 0.0,
"arg_validity": 0.0,
"issues": []
}
if actual_call:
actual_args = actual_call.get("args", {})
# Check required args are present
present_count = sum(1 for arg in expected.required_args if arg in actual_args)
completeness = present_count / len(expected.required_args) if expected.required_args else 1.0
call_detail["arg_completeness"] = completeness
arg_scores.append(completeness)
# Check arg validity
if expected.arg_validators:
valid_count = 0
for arg_name, validator in expected.arg_validators.items():
if arg_name in actual_args:
try:
is_valid = validator(actual_args[arg_name])
if is_valid:
valid_count += 1
else:
call_detail["issues"].append(f"Invalid value for arg '{arg_name}'")
except Exception as e:
call_detail["issues"].append(f"Arg '{arg_name}' validation error: {e}")
validity = valid_count / len(expected.arg_validators)
call_detail["arg_validity"] = validity
validity_scores.append(validity)
else:
if expected.must_be_called:
call_detail["issues"].append(f"Required tool '{expected.tool_name}' was not called")
arg_scores.append(0.0)
metrics["details"].append(call_detail)
# Aggregate scores
if arg_scores:
metrics["argument_completeness"] = sum(arg_scores) / len(arg_scores)
if validity_scores:
metrics["argument_validity"] = sum(validity_scores) / len(validity_scores)
# Spurious calls
expected_tool_names = {e.tool_name for e in expected_calls}
spurious = [c for c in actual_tool_calls if c.get("name") not in expected_tool_names]
metrics["spurious_call_rate"] = len(spurious) / max(len(actual_tool_calls), 1)
return metrics
# Example usage
test_case = {
"input": "Search for the latest Python 3.12 release notes",
"expected_tool_calls": [
ToolCallExpectation(
tool_name="web_search",
required_args=["query"],
arg_validators={
"query": lambda q: "python" in q.lower() and ("3.12" in q or "release" in q.lower())
}
)
]
}
Metric 3: Hallucination Rate#
Definition: The frequency of factually incorrect or fabricated information in agent outputs.
def evaluate_hallucination_rate(
agent_outputs: List[str],
ground_truth_contexts: List[str],
evaluator_llm=None
) -> Dict[str, float]:
"""
Measure hallucination rate using LLM-as-judge.
Args:
agent_outputs: List of agent response strings
ground_truth_contexts: The source material the agent should draw from
evaluator_llm: LLM to use as judge (defaults to gpt-4o)
"""
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
if evaluator_llm is None:
evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)
hallucination_scores = []
for output, context in zip(agent_outputs, ground_truth_contexts):
eval_prompt = f"""Evaluate whether the following AI response contains hallucinations.
A hallucination is a claim in the response that:
- Is not supported by the provided context
- Contradicts the provided context
- Introduces specific facts not present in the context (names, numbers, dates, etc.)
Context (ground truth):
{context}
AI Response to evaluate:
{output}
Respond with:
HALLUCINATION_SCORE: [0.0 to 1.0 where 0.0 = no hallucinations, 1.0 = severely hallucinated]
HALLUCINATED_CLAIMS: [List any specific hallucinated claims, or "None"]
REASONING: [Brief explanation of your assessment]"""
response = evaluator_llm.invoke([HumanMessage(content=eval_prompt)])
content = response.content
# Parse score
score = 0.0
if "HALLUCINATION_SCORE:" in content:
try:
score_str = content.split("HALLUCINATION_SCORE:")[1].split("\n")[0].strip()
score = float(score_str)
except (ValueError, IndexError):
score = 0.5 # Default to middle if parsing fails
hallucination_scores.append(score)
return {
"hallucination_rate": sum(hallucination_scores) / len(hallucination_scores),
"max_hallucination": max(hallucination_scores),
"hallucination_free_rate": sum(1 for s in hallucination_scores if s < 0.1) / len(hallucination_scores),
"high_hallucination_rate": sum(1 for s in hallucination_scores if s > 0.5) / len(hallucination_scores),
"individual_scores": hallucination_scores
}
Metric 4: Latency#
Definition: Wall-clock time to complete a task from input to final output.
import time
import statistics
from contextlib import contextmanager
@dataclass
class LatencyRecord:
run_id: str
total_ms: float
time_to_first_token_ms: Optional[float]
tool_call_count: int
token_count: int
@contextmanager
def measure_latency():
"""Context manager for measuring agent execution time."""
start = time.perf_counter()
yield
elapsed = (time.perf_counter() - start) * 1000
return elapsed
def benchmark_agent_latency(
agent_fn: Callable,
test_inputs: List[str],
warmup_runs: int = 2
) -> Dict[str, float]:
"""
Benchmark agent latency across multiple inputs.
Includes warmup runs to account for cold start effects.
"""
latencies = []
# Warmup runs (excluded from results)
for i in range(warmup_runs):
if test_inputs:
start = time.perf_counter()
try:
agent_fn(test_inputs[0])
except Exception:
pass
_ = (time.perf_counter() - start) * 1000
# Measured runs
for test_input in test_inputs:
start = time.perf_counter()
try:
agent_fn(test_input)
except Exception:
pass
elapsed_ms = (time.perf_counter() - start) * 1000
latencies.append(elapsed_ms)
if not latencies:
return {}
return {
"mean_latency_ms": statistics.mean(latencies),
"median_latency_ms": statistics.median(latencies),
"p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
"p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)],
"min_latency_ms": min(latencies),
"max_latency_ms": max(latencies),
"std_latency_ms": statistics.stdev(latencies) if len(latencies) > 1 else 0,
"total_runs": len(latencies)
}
Metric 5: Cost Per Task#
Definition: Average USD cost in API tokens to complete one task.
from langchain.callbacks import get_openai_callback
def measure_cost_per_task(
agent_fn: Callable,
test_inputs: List[str]
) -> Dict[str, float]:
"""Measure API cost across test inputs."""
costs = []
token_counts = []
for test_input in test_inputs:
with get_openai_callback() as cb:
try:
agent_fn(test_input)
except Exception:
pass
costs.append(cb.total_cost)
token_counts.append(cb.total_tokens)
return {
"mean_cost_usd": statistics.mean(costs) if costs else 0,
"total_cost_usd": sum(costs),
"mean_tokens_per_task": statistics.mean(token_counts) if token_counts else 0,
"cost_per_1000_tasks_usd": statistics.mean(costs) * 1000 if costs else 0,
"monthly_cost_estimate_usd": statistics.mean(costs) * 1000 * 30 if costs else 0,
"individual_costs": costs
}
Full Evaluation Pipeline with LangSmith#
from langsmith import Client, evaluate
from langsmith.evaluation import EvaluationResult
client = Client()
# 1. Create an evaluation dataset
dataset_name = "agent_evaluation_suite_v1"
examples = [
{
"inputs": {"query": "What is the capital of France?"},
"outputs": {"answer": "Paris"}
},
{
"inputs": {"query": "Calculate 15% of $340"},
"outputs": {"answer": "$51"}
},
{
"inputs": {"query": "Summarize the key benefits of using LangGraph"},
"outputs": {"answer": "stateful workflows, checkpointing, graph-based orchestration"}
}
]
# Create dataset if it doesn't exist
try:
dataset = client.read_dataset(dataset_name=dataset_name)
except Exception:
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
inputs=[e["inputs"] for e in examples],
outputs=[e["outputs"] for e in examples],
dataset_id=dataset.id
)
# 2. Define the agent function to evaluate
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage
eval_llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
def agent_under_test(inputs: Dict) -> Dict:
"""The agent function to evaluate."""
query = inputs["query"]
response = eval_llm.invoke([HumanMessage(content=query)])
return {"answer": response.content}
# 3. Define evaluators
def correctness_evaluator(run, example) -> EvaluationResult:
"""Evaluate factual correctness of agent answer."""
actual = run.outputs.get("answer", "")
expected = example.outputs.get("answer", "")
# Check if key terms from expected are in actual
expected_terms = [t for t in expected.lower().split() if len(t) > 3]
matches = sum(1 for term in expected_terms if term in actual.lower())
score = matches / len(expected_terms) if expected_terms else 0
return EvaluationResult(key="correctness", score=score)
def conciseness_evaluator(run, example) -> EvaluationResult:
"""Evaluate response conciseness (shorter = better for factual Q&A)."""
actual = run.outputs.get("answer", "")
# Score inversely proportional to length (capped at 200 words)
word_count = len(actual.split())
score = max(0, 1 - (word_count / 200))
return EvaluationResult(key="conciseness", score=score)
# 4. Run evaluation
results = evaluate(
agent_under_test,
data=dataset_name,
evaluators=[correctness_evaluator, conciseness_evaluator],
experiment_prefix="baseline-gpt4o",
metadata={"model": "gpt-4o", "version": "2026-03-01"}
)
print("Evaluation Results:")
print(f" Correctness: {results.to_pandas()['feedback.correctness'].mean():.2%}")
print(f" Conciseness: {results.to_pandas()['feedback.conciseness'].mean():.2%}")
Building a Comprehensive Evaluation Report#
from dataclasses import dataclass
@dataclass
class AgentEvaluationReport:
"""Comprehensive evaluation report for an AI agent."""
agent_name: str
evaluation_date: str
test_case_count: int
# Core metrics
task_completion_rate: float
tool_call_accuracy: float
hallucination_rate: float
mean_latency_ms: float
mean_cost_usd: float
# Derived metrics
p95_latency_ms: float
cost_per_1000_tasks_usd: float
def overall_score(self) -> float:
"""Compute a weighted overall health score (0-100)."""
weights = {
"task_completion": 0.30,
"tool_accuracy": 0.25,
"hallucination_free": 0.25,
"latency_score": 0.10,
"cost_score": 0.10
}
scores = {
"task_completion": self.task_completion_rate,
"tool_accuracy": self.tool_call_accuracy,
"hallucination_free": 1 - self.hallucination_rate,
"latency_score": max(0, 1 - (self.mean_latency_ms / 10000)), # 10s = 0 score
"cost_score": max(0, 1 - (self.mean_cost_usd / 0.10)) # $0.10 = 0 score
}
return sum(scores[k] * weights[k] for k in weights) * 100
def print_report(self) -> None:
"""Print a formatted evaluation report."""
print(f"\n{'='*60}")
print(f"AGENT EVALUATION REPORT: {self.agent_name}")
print(f"Date: {self.evaluation_date} | Tests: {self.test_case_count}")
print(f"{'='*60}")
print(f"\nOVERALL SCORE: {self.overall_score():.1f}/100")
print(f"\nCORE METRICS:")
print(f" Task Completion Rate: {self.task_completion_rate:.1%}")
print(f" Tool Call Accuracy: {self.tool_call_accuracy:.1%}")
print(f" Hallucination Rate: {self.hallucination_rate:.1%}")
print(f" Mean Latency: {self.mean_latency_ms:.0f}ms")
print(f" Mean Cost Per Task: ${self.mean_cost_usd:.4f}")
print(f"\nEFFICIENCY METRICS:")
print(f" P95 Latency: {self.p95_latency_ms:.0f}ms")
print(f" Cost per 1K Tasks: ${self.cost_per_1000_tasks_usd:.2f}")
print(f"{'='*60}")
# Health indicators
print(f"\nHEALTH INDICATORS:")
indicators = [
("Task Completion >= 90%", self.task_completion_rate >= 0.90),
("Tool Accuracy >= 85%", self.tool_call_accuracy >= 0.85),
("Hallucination < 10%", self.hallucination_rate < 0.10),
("Median Latency < 5s", self.mean_latency_ms < 5000),
("Cost < $0.05/task", self.mean_cost_usd < 0.05)
]
for label, passed in indicators:
status = "PASS" if passed else "FAIL"
print(f" [{status}] {label}")
def run_full_evaluation(agent_fn: Callable, test_suite: List[Dict]) -> AgentEvaluationReport:
"""Run all evaluations and return comprehensive report."""
print("Running evaluation suite...")
# Extract components from test suite
inputs = [t["input"] for t in test_suite]
contexts = [t.get("context", "") for t in test_suite]
# Run all metrics in parallel where possible
print(" Measuring task completion...")
completion_results = measure_task_completion(
agent_fn,
test_suite,
lambda out, exp: bool(out) and len(out) > 10 # Simple validator
)
print(" Measuring latency...")
latency_results = benchmark_agent_latency(agent_fn, inputs[:10]) # Sample for speed
print(" Measuring cost...")
cost_results = measure_cost_per_task(agent_fn, inputs[:10]) # Sample for cost
print(" Evaluating hallucinations (this may take a moment)...")
outputs = []
for inp in inputs[:10]:
try:
outputs.append(str(agent_fn(inp)))
except Exception:
outputs.append("")
hallucination_results = evaluate_hallucination_rate(
outputs,
contexts[:10] if contexts else [""] * len(outputs)
)
return AgentEvaluationReport(
agent_name="My Agent v1",
evaluation_date=datetime.now().strftime("%Y-%m-%d"),
test_case_count=len(test_suite),
task_completion_rate=completion_results["task_completion_rate"],
tool_call_accuracy=0.85, # Placeholder — compute from tool_call evaluator
hallucination_rate=hallucination_results["hallucination_rate"],
mean_latency_ms=latency_results.get("mean_latency_ms", 0),
mean_cost_usd=cost_results.get("mean_cost_usd", 0),
p95_latency_ms=latency_results.get("p95_latency_ms", 0),
cost_per_1000_tasks_usd=cost_results.get("cost_per_1000_tasks_usd", 0)
)
# Example usage
test_suite = [
{"input": "What is the largest planet in the solar system?", "context": "Jupiter is the largest planet."},
{"input": "Summarize the key benefits of TypeScript", "context": "TypeScript provides static typing, better IDE support..."},
{"input": "How do I reverse a list in Python?", "context": "Use list[::-1] or list.reverse() in Python"},
]
simple_agent = lambda inp: eval_llm.invoke([HumanMessage(content=inp)]).content
report = run_full_evaluation(simple_agent, test_suite)
report.print_report()
Setting Up Continuous Evaluation in CI/CD#
# eval_pipeline.py — Run in CI/CD
import sys
def run_ci_evaluation():
"""Run evaluation and fail CI if metrics below threshold."""
report = run_full_evaluation(agent_fn, load_test_suite())
thresholds = {
"task_completion_rate": 0.85,
"hallucination_rate": 0.15, # Maximum allowed
"mean_latency_ms": 8000 # Maximum allowed
}
failures = []
if report.task_completion_rate < thresholds["task_completion_rate"]:
failures.append(f"Task completion {report.task_completion_rate:.1%} below {thresholds['task_completion_rate']:.1%}")
if report.hallucination_rate > thresholds["hallucination_rate"]:
failures.append(f"Hallucination rate {report.hallucination_rate:.1%} above {thresholds['hallucination_rate']:.1%}")
if report.mean_latency_ms > thresholds["mean_latency_ms"]:
failures.append(f"Latency {report.mean_latency_ms:.0f}ms above {thresholds['mean_latency_ms']}ms")
report.print_report()
if failures:
print("\nCI FAILURES:")
for failure in failures:
print(f" FAILED: {failure}")
sys.exit(1) # Fail CI pipeline
print("\nAll evaluation thresholds passed.")
sys.exit(0)
if __name__ == "__main__":
run_ci_evaluation()
Key Takeaways#
Building a robust agent evaluation system requires:
- Task completion rate — start here, it is the most important metric for production readiness
- Tool call accuracy — validate not just that tools are called, but that arguments are correct
- Hallucination rate — use LLM-as-judge for subjective assessment of factual grounding
- Latency — measure P95 and P99, not just mean — tail latency kills user experience
- Cost per task — project at scale; $0.05/task sounds small until it is $5,000/day at 100K tasks
- Continuous evaluation — integrate your eval suite into CI/CD so regressions are caught before deployment
The evaluation framework built in this tutorial is not complete — it is a foundation. As you understand your agent's specific failure modes, add custom evaluators that test for the specific ways your agent breaks. An evaluation suite that specifically captures your agent's failure modes is far more valuable than a generic benchmark score.
For complementary topics, see our guides on AI Agent Testing Tools, Agent Observability with Langfuse, and Agentic RAG Evaluation.