brown ruler with stand — Photo by Markus Spiske on Unsplash

AI Agent Evaluation Metrics: How to Measure Agent Performance

You cannot improve what you do not measure. AI agents that perform well in development frequently degrade in production — model updates change behavior subtly, edge cases surface, and prompts that worked in testing break on real user input. Without a systematic evaluation framework, you are flying blind.

This tutorial builds a complete AI agent evaluation system from scratch. We cover the five core metrics, build Python implementations for each, then assemble a full evaluation pipeline using LangSmith and Braintrust.

Prerequisites: Python 3.11+, a working AI agent to evaluate, basic understanding of pytest.

The Five Core Agent Evaluation Metrics#

Metric 1: Task Completion Rate#

Definition: The percentage of tasks the agent successfully completes end-to-end without human intervention.

Why it matters: This is the most fundamental measure of agent capability. An agent with 70% task completion is failing 3 in 10 users — that is a serious reliability problem in production.

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional, Callable
from datetime import datetime
import time
import json

@dataclass
class AgentRun:
    """Record of a single agent task execution."""
    run_id: str
    input: str
    output: Optional[str]
    expected_output: Optional[str]
    tool_calls: List[Dict]
    completed: bool
    error: Optional[str]
    latency_ms: float
    total_tokens: int
    cost_usd: float
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())

def measure_task_completion(
    agent_fn: Callable,
    test_cases: List[Dict],
    completion_validator: Callable
) -> Dict[str, float]:
    """
    Measure task completion rate across a set of test cases.

    Args:
        agent_fn: Function that runs the agent and returns output
        test_cases: List of {'input': str, 'expected': str} dicts
        completion_validator: Function(output, expected) -> bool
    """
    results = []

    for case in test_cases:
        start_time = time.time()

        try:
            output = agent_fn(case["input"])
            completed = completion_validator(output, case.get("expected"))
            error = None
        except Exception as e:
            output = None
            completed = False
            error = str(e)

        latency = (time.time() - start_time) * 1000

        results.append({
            "input": case["input"],
            "output": output,
            "completed": completed,
            "error": error,
            "latency_ms": latency
        })

    total = len(results)
    completed_count = sum(1 for r in results if r["completed"])
    error_count = sum(1 for r in results if r["error"])

    return {
        "task_completion_rate": completed_count / total,
        "error_rate": error_count / total,
        "total_tasks": total,
        "completed_tasks": completed_count,
        "failed_tasks": total - completed_count,
        "avg_latency_ms": sum(r["latency_ms"] for r in results) / total,
        "results": results
    }

# Example completion validator
def validate_answer_completeness(output: str, expected: str) -> bool:
    """Check if output contains the key elements of the expected answer."""
    if not output:
        return False
    # Check for key terms from expected output
    expected_terms = expected.lower().split()[:5]  # First 5 words as key terms
    return all(term in output.lower() for term in expected_terms if len(term) > 3)

Metric 2: Tool Call Accuracy#

Definition: Whether the agent selects the correct tool and provides well-formed, correct arguments.

Why it matters: An agent that calls the wrong tool or provides malformed arguments wastes API calls, produces wrong outputs, and can cause side effects in connected systems.

from typing import List, Dict

@dataclass
class ToolCallExpectation:
    """Expected tool call specification."""
    tool_name: str
    required_args: List[str]  # Arguments that must be present
    arg_validators: Dict[str, Callable]  # Arg name -> validation function
    must_be_called: bool = True  # Whether this tool must be called
    call_order: Optional[int] = None  # Expected position in call sequence

def evaluate_tool_call_accuracy(
    actual_tool_calls: List[Dict],
    expected_calls: List[ToolCallExpectation]
) -> Dict[str, Any]:
    """
    Evaluate the accuracy of an agent's tool calling behavior.

    Args:
        actual_tool_calls: List of {'name': str, 'args': dict} from agent run
        expected_calls: List of ToolCallExpectation specifications
    """
    metrics = {
        "tool_selection_accuracy": 0.0,    # Correct tool selected
        "argument_completeness": 0.0,       # Required args present
        "argument_validity": 0.0,           # Args pass validation
        "call_order_accuracy": 0.0,         # Calls in right order
        "spurious_call_rate": 0.0,          # Unexpected tool calls
        "details": []
    }

    actual_names = [c.get("name", "") for c in actual_tool_calls]
    expected_names = [e.tool_name for e in expected_calls if e.must_be_called]

    # Tool selection accuracy
    correct_selections = sum(1 for e in expected_names if e in actual_names)
    if expected_names:
        metrics["tool_selection_accuracy"] = correct_selections / len(expected_names)

    # Argument analysis for each expected call
    arg_scores = []
    validity_scores = []

    for expected in expected_calls:
        # Find the actual call
        actual_call = next(
            (c for c in actual_tool_calls if c.get("name") == expected.tool_name),
            None
        )

        call_detail = {
            "expected_tool": expected.tool_name,
            "called": actual_call is not None,
            "arg_completeness": 0.0,
            "arg_validity": 0.0,
            "issues": []
        }

        if actual_call:
            actual_args = actual_call.get("args", {})

            # Check required args are present
            present_count = sum(1 for arg in expected.required_args if arg in actual_args)
            completeness = present_count / len(expected.required_args) if expected.required_args else 1.0
            call_detail["arg_completeness"] = completeness
            arg_scores.append(completeness)

            # Check arg validity
            if expected.arg_validators:
                valid_count = 0
                for arg_name, validator in expected.arg_validators.items():
                    if arg_name in actual_args:
                        try:
                            is_valid = validator(actual_args[arg_name])
                            if is_valid:
                                valid_count += 1
                            else:
                                call_detail["issues"].append(f"Invalid value for arg '{arg_name}'")
                        except Exception as e:
                            call_detail["issues"].append(f"Arg '{arg_name}' validation error: {e}")
                validity = valid_count / len(expected.arg_validators)
                call_detail["arg_validity"] = validity
                validity_scores.append(validity)
        else:
            if expected.must_be_called:
                call_detail["issues"].append(f"Required tool '{expected.tool_name}' was not called")
            arg_scores.append(0.0)

        metrics["details"].append(call_detail)

    # Aggregate scores
    if arg_scores:
        metrics["argument_completeness"] = sum(arg_scores) / len(arg_scores)
    if validity_scores:
        metrics["argument_validity"] = sum(validity_scores) / len(validity_scores)

    # Spurious calls
    expected_tool_names = {e.tool_name for e in expected_calls}
    spurious = [c for c in actual_tool_calls if c.get("name") not in expected_tool_names]
    metrics["spurious_call_rate"] = len(spurious) / max(len(actual_tool_calls), 1)

    return metrics

# Example usage
test_case = {
    "input": "Search for the latest Python 3.12 release notes",
    "expected_tool_calls": [
        ToolCallExpectation(
            tool_name="web_search",
            required_args=["query"],
            arg_validators={
                "query": lambda q: "python" in q.lower() and ("3.12" in q or "release" in q.lower())
            }
        )
    ]
}

Metric 3: Hallucination Rate#

Definition: The frequency of factually incorrect or fabricated information in agent outputs.

def evaluate_hallucination_rate(
    agent_outputs: List[str],
    ground_truth_contexts: List[str],
    evaluator_llm=None
) -> Dict[str, float]:
    """
    Measure hallucination rate using LLM-as-judge.

    Args:
        agent_outputs: List of agent response strings
        ground_truth_contexts: The source material the agent should draw from
        evaluator_llm: LLM to use as judge (defaults to gpt-4o)
    """
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import HumanMessage

    if evaluator_llm is None:
        evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

    hallucination_scores = []

    for output, context in zip(agent_outputs, ground_truth_contexts):
        eval_prompt = f"""Evaluate whether the following AI response contains hallucinations.

A hallucination is a claim in the response that:
- Is not supported by the provided context
- Contradicts the provided context
- Introduces specific facts not present in the context (names, numbers, dates, etc.)

Context (ground truth):
{context}

AI Response to evaluate:
{output}

Respond with:
HALLUCINATION_SCORE: [0.0 to 1.0 where 0.0 = no hallucinations, 1.0 = severely hallucinated]
HALLUCINATED_CLAIMS: [List any specific hallucinated claims, or "None"]
REASONING: [Brief explanation of your assessment]"""

        response = evaluator_llm.invoke([HumanMessage(content=eval_prompt)])
        content = response.content

        # Parse score
        score = 0.0
        if "HALLUCINATION_SCORE:" in content:
            try:
                score_str = content.split("HALLUCINATION_SCORE:")[1].split("\n")[0].strip()
                score = float(score_str)
            except (ValueError, IndexError):
                score = 0.5  # Default to middle if parsing fails

        hallucination_scores.append(score)

    return {
        "hallucination_rate": sum(hallucination_scores) / len(hallucination_scores),
        "max_hallucination": max(hallucination_scores),
        "hallucination_free_rate": sum(1 for s in hallucination_scores if s < 0.1) / len(hallucination_scores),
        "high_hallucination_rate": sum(1 for s in hallucination_scores if s > 0.5) / len(hallucination_scores),
        "individual_scores": hallucination_scores
    }

Metric 4: Latency#

Definition: Wall-clock time to complete a task from input to final output.

import time
import statistics
from contextlib import contextmanager

@dataclass
class LatencyRecord:
    run_id: str
    total_ms: float
    time_to_first_token_ms: Optional[float]
    tool_call_count: int
    token_count: int

@contextmanager
def measure_latency():
    """Context manager for measuring agent execution time."""
    start = time.perf_counter()
    yield
    elapsed = (time.perf_counter() - start) * 1000
    return elapsed

def benchmark_agent_latency(
    agent_fn: Callable,
    test_inputs: List[str],
    warmup_runs: int = 2
) -> Dict[str, float]:
    """
    Benchmark agent latency across multiple inputs.

    Includes warmup runs to account for cold start effects.
    """
    latencies = []

    # Warmup runs (excluded from results)
    for i in range(warmup_runs):
        if test_inputs:
            start = time.perf_counter()
            try:
                agent_fn(test_inputs[0])
            except Exception:
                pass
            _ = (time.perf_counter() - start) * 1000

    # Measured runs
    for test_input in test_inputs:
        start = time.perf_counter()
        try:
            agent_fn(test_input)
        except Exception:
            pass
        elapsed_ms = (time.perf_counter() - start) * 1000
        latencies.append(elapsed_ms)

    if not latencies:
        return {}

    return {
        "mean_latency_ms": statistics.mean(latencies),
        "median_latency_ms": statistics.median(latencies),
        "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
        "p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)],
        "min_latency_ms": min(latencies),
        "max_latency_ms": max(latencies),
        "std_latency_ms": statistics.stdev(latencies) if len(latencies) > 1 else 0,
        "total_runs": len(latencies)
    }

Metric 5: Cost Per Task#

Definition: Average USD cost in API tokens to complete one task.

from langchain.callbacks import get_openai_callback

def measure_cost_per_task(
    agent_fn: Callable,
    test_inputs: List[str]
) -> Dict[str, float]:
    """Measure API cost across test inputs."""

    costs = []
    token_counts = []

    for test_input in test_inputs:
        with get_openai_callback() as cb:
            try:
                agent_fn(test_input)
            except Exception:
                pass

        costs.append(cb.total_cost)
        token_counts.append(cb.total_tokens)

    return {
        "mean_cost_usd": statistics.mean(costs) if costs else 0,
        "total_cost_usd": sum(costs),
        "mean_tokens_per_task": statistics.mean(token_counts) if token_counts else 0,
        "cost_per_1000_tasks_usd": statistics.mean(costs) * 1000 if costs else 0,
        "monthly_cost_estimate_usd": statistics.mean(costs) * 1000 * 30 if costs else 0,
        "individual_costs": costs
    }

Full Evaluation Pipeline with LangSmith#

from langsmith import Client, evaluate
from langsmith.evaluation import EvaluationResult

client = Client()

# 1. Create an evaluation dataset
dataset_name = "agent_evaluation_suite_v1"

examples = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {"answer": "Paris"}
    },
    {
        "inputs": {"query": "Calculate 15% of $340"},
        "outputs": {"answer": "$51"}
    },
    {
        "inputs": {"query": "Summarize the key benefits of using LangGraph"},
        "outputs": {"answer": "stateful workflows, checkpointing, graph-based orchestration"}
    }
]

# Create dataset if it doesn't exist
try:
    dataset = client.read_dataset(dataset_name=dataset_name)
except Exception:
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[e["inputs"] for e in examples],
        outputs=[e["outputs"] for e in examples],
        dataset_id=dataset.id
    )

# 2. Define the agent function to evaluate
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

eval_llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

def agent_under_test(inputs: Dict) -> Dict:
    """The agent function to evaluate."""
    query = inputs["query"]
    response = eval_llm.invoke([HumanMessage(content=query)])
    return {"answer": response.content}

# 3. Define evaluators
def correctness_evaluator(run, example) -> EvaluationResult:
    """Evaluate factual correctness of agent answer."""
    actual = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")

    # Check if key terms from expected are in actual
    expected_terms = [t for t in expected.lower().split() if len(t) > 3]
    matches = sum(1 for term in expected_terms if term in actual.lower())
    score = matches / len(expected_terms) if expected_terms else 0

    return EvaluationResult(key="correctness", score=score)

def conciseness_evaluator(run, example) -> EvaluationResult:
    """Evaluate response conciseness (shorter = better for factual Q&A)."""
    actual = run.outputs.get("answer", "")
    # Score inversely proportional to length (capped at 200 words)
    word_count = len(actual.split())
    score = max(0, 1 - (word_count / 200))
    return EvaluationResult(key="conciseness", score=score)

# 4. Run evaluation
results = evaluate(
    agent_under_test,
    data=dataset_name,
    evaluators=[correctness_evaluator, conciseness_evaluator],
    experiment_prefix="baseline-gpt4o",
    metadata={"model": "gpt-4o", "version": "2026-03-01"}
)

print("Evaluation Results:")
print(f"  Correctness: {results.to_pandas()['feedback.correctness'].mean():.2%}")
print(f"  Conciseness: {results.to_pandas()['feedback.conciseness'].mean():.2%}")

Building a Comprehensive Evaluation Report#

from dataclasses import dataclass

@dataclass
class AgentEvaluationReport:
    """Comprehensive evaluation report for an AI agent."""
    agent_name: str
    evaluation_date: str
    test_case_count: int

    # Core metrics
    task_completion_rate: float
    tool_call_accuracy: float
    hallucination_rate: float
    mean_latency_ms: float
    mean_cost_usd: float

    # Derived metrics
    p95_latency_ms: float
    cost_per_1000_tasks_usd: float

    def overall_score(self) -> float:
        """Compute a weighted overall health score (0-100)."""
        weights = {
            "task_completion": 0.30,
            "tool_accuracy": 0.25,
            "hallucination_free": 0.25,
            "latency_score": 0.10,
            "cost_score": 0.10
        }

        scores = {
            "task_completion": self.task_completion_rate,
            "tool_accuracy": self.tool_call_accuracy,
            "hallucination_free": 1 - self.hallucination_rate,
            "latency_score": max(0, 1 - (self.mean_latency_ms / 10000)),  # 10s = 0 score
            "cost_score": max(0, 1 - (self.mean_cost_usd / 0.10))  # $0.10 = 0 score
        }

        return sum(scores[k] * weights[k] for k in weights) * 100

    def print_report(self) -> None:
        """Print a formatted evaluation report."""
        print(f"\n{'='*60}")
        print(f"AGENT EVALUATION REPORT: {self.agent_name}")
        print(f"Date: {self.evaluation_date} | Tests: {self.test_case_count}")
        print(f"{'='*60}")
        print(f"\nOVERALL SCORE: {self.overall_score():.1f}/100")
        print(f"\nCORE METRICS:")
        print(f"  Task Completion Rate:   {self.task_completion_rate:.1%}")
        print(f"  Tool Call Accuracy:     {self.tool_call_accuracy:.1%}")
        print(f"  Hallucination Rate:     {self.hallucination_rate:.1%}")
        print(f"  Mean Latency:           {self.mean_latency_ms:.0f}ms")
        print(f"  Mean Cost Per Task:     ${self.mean_cost_usd:.4f}")
        print(f"\nEFFICIENCY METRICS:")
        print(f"  P95 Latency:            {self.p95_latency_ms:.0f}ms")
        print(f"  Cost per 1K Tasks:      ${self.cost_per_1000_tasks_usd:.2f}")
        print(f"{'='*60}")

        # Health indicators
        print(f"\nHEALTH INDICATORS:")
        indicators = [
            ("Task Completion >= 90%", self.task_completion_rate >= 0.90),
            ("Tool Accuracy >= 85%", self.tool_call_accuracy >= 0.85),
            ("Hallucination < 10%", self.hallucination_rate < 0.10),
            ("Median Latency < 5s", self.mean_latency_ms < 5000),
            ("Cost < $0.05/task", self.mean_cost_usd < 0.05)
        ]

        for label, passed in indicators:
            status = "PASS" if passed else "FAIL"
            print(f"  [{status}] {label}")

def run_full_evaluation(agent_fn: Callable, test_suite: List[Dict]) -> AgentEvaluationReport:
    """Run all evaluations and return comprehensive report."""

    print("Running evaluation suite...")

    # Extract components from test suite
    inputs = [t["input"] for t in test_suite]
    contexts = [t.get("context", "") for t in test_suite]

    # Run all metrics in parallel where possible
    print("  Measuring task completion...")
    completion_results = measure_task_completion(
        agent_fn,
        test_suite,
        lambda out, exp: bool(out) and len(out) > 10  # Simple validator
    )

    print("  Measuring latency...")
    latency_results = benchmark_agent_latency(agent_fn, inputs[:10])  # Sample for speed

    print("  Measuring cost...")
    cost_results = measure_cost_per_task(agent_fn, inputs[:10])  # Sample for cost

    print("  Evaluating hallucinations (this may take a moment)...")
    outputs = []
    for inp in inputs[:10]:
        try:
            outputs.append(str(agent_fn(inp)))
        except Exception:
            outputs.append("")

    hallucination_results = evaluate_hallucination_rate(
        outputs,
        contexts[:10] if contexts else [""] * len(outputs)
    )

    return AgentEvaluationReport(
        agent_name="My Agent v1",
        evaluation_date=datetime.now().strftime("%Y-%m-%d"),
        test_case_count=len(test_suite),
        task_completion_rate=completion_results["task_completion_rate"],
        tool_call_accuracy=0.85,  # Placeholder — compute from tool_call evaluator
        hallucination_rate=hallucination_results["hallucination_rate"],
        mean_latency_ms=latency_results.get("mean_latency_ms", 0),
        mean_cost_usd=cost_results.get("mean_cost_usd", 0),
        p95_latency_ms=latency_results.get("p95_latency_ms", 0),
        cost_per_1000_tasks_usd=cost_results.get("cost_per_1000_tasks_usd", 0)
    )

# Example usage
test_suite = [
    {"input": "What is the largest planet in the solar system?", "context": "Jupiter is the largest planet."},
    {"input": "Summarize the key benefits of TypeScript", "context": "TypeScript provides static typing, better IDE support..."},
    {"input": "How do I reverse a list in Python?", "context": "Use list[::-1] or list.reverse() in Python"},
]

simple_agent = lambda inp: eval_llm.invoke([HumanMessage(content=inp)]).content

report = run_full_evaluation(simple_agent, test_suite)
report.print_report()

Setting Up Continuous Evaluation in CI/CD#

# eval_pipeline.py — Run in CI/CD
import sys

def run_ci_evaluation():
    """Run evaluation and fail CI if metrics below threshold."""
    report = run_full_evaluation(agent_fn, load_test_suite())

    thresholds = {
        "task_completion_rate": 0.85,
        "hallucination_rate": 0.15,  # Maximum allowed
        "mean_latency_ms": 8000      # Maximum allowed
    }

    failures = []

    if report.task_completion_rate < thresholds["task_completion_rate"]:
        failures.append(f"Task completion {report.task_completion_rate:.1%} below {thresholds['task_completion_rate']:.1%}")

    if report.hallucination_rate > thresholds["hallucination_rate"]:
        failures.append(f"Hallucination rate {report.hallucination_rate:.1%} above {thresholds['hallucination_rate']:.1%}")

    if report.mean_latency_ms > thresholds["mean_latency_ms"]:
        failures.append(f"Latency {report.mean_latency_ms:.0f}ms above {thresholds['mean_latency_ms']}ms")

    report.print_report()

    if failures:
        print("\nCI FAILURES:")
        for failure in failures:
            print(f"  FAILED: {failure}")
        sys.exit(1)  # Fail CI pipeline

    print("\nAll evaluation thresholds passed.")
    sys.exit(0)

if __name__ == "__main__":
    run_ci_evaluation()

Key Takeaways#

Building a robust agent evaluation system requires:

Task completion rate — start here, it is the most important metric for production readiness
Tool call accuracy — validate not just that tools are called, but that arguments are correct
Hallucination rate — use LLM-as-judge for subjective assessment of factual grounding
Latency — measure P95 and P99, not just mean — tail latency kills user experience
Cost per task — project at scale; $0.05/task sounds small until it is $5,000/day at 100K tasks
Continuous evaluation — integrate your eval suite into CI/CD so regressions are caught before deployment

The evaluation framework built in this tutorial is not complete — it is a foundation. As you understand your agent's specific failure modes, add custom evaluators that test for the specific ways your agent breaks. An evaluation suite that specifically captures your agent's failure modes is far more valuable than a generic benchmark score.

For complementary topics, see our guides on AI Agent Testing Tools, Agent Observability with Langfuse, and Agentic RAG Evaluation.

AI Agent Evaluation Metrics: How to Measure Agent Performance

Prerequisites: Python 3.11+, a working AI agent to evaluate, basic understanding of pytest.

The Five Core Agent Evaluation Metrics#

Metric 1: Task Completion Rate#

Definition: The percentage of tasks the agent successfully completes end-to-end without human intervention.

Why it matters: This is the most fundamental measure of agent capability. An agent with 70% task completion is failing 3 in 10 users — that is a serious reliability problem in production.

from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional, Callable
from datetime import datetime
import time
import json

@dataclass
class AgentRun:
    """Record of a single agent task execution."""
    run_id: str
    input: str
    output: Optional[str]
    expected_output: Optional[str]
    tool_calls: List[Dict]
    completed: bool
    error: Optional[str]
    latency_ms: float
    total_tokens: int
    cost_usd: float
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())

def measure_task_completion(
    agent_fn: Callable,
    test_cases: List[Dict],
    completion_validator: Callable
) -> Dict[str, float]:
    """
    Measure task completion rate across a set of test cases.

    Args:
        agent_fn: Function that runs the agent and returns output
        test_cases: List of {'input': str, 'expected': str} dicts
        completion_validator: Function(output, expected) -> bool
    """
    results = []

    for case in test_cases:
        start_time = time.time()

        try:
            output = agent_fn(case["input"])
            completed = completion_validator(output, case.get("expected"))
            error = None
        except Exception as e:
            output = None
            completed = False
            error = str(e)

        latency = (time.time() - start_time) * 1000

        results.append({
            "input": case["input"],
            "output": output,
            "completed": completed,
            "error": error,
            "latency_ms": latency
        })

    total = len(results)
    completed_count = sum(1 for r in results if r["completed"])
    error_count = sum(1 for r in results if r["error"])

    return {
        "task_completion_rate": completed_count / total,
        "error_rate": error_count / total,
        "total_tasks": total,
        "completed_tasks": completed_count,
        "failed_tasks": total - completed_count,
        "avg_latency_ms": sum(r["latency_ms"] for r in results) / total,
        "results": results
    }

# Example completion validator
def validate_answer_completeness(output: str, expected: str) -> bool:
    """Check if output contains the key elements of the expected answer."""
    if not output:
        return False
    # Check for key terms from expected output
    expected_terms = expected.lower().split()[:5]  # First 5 words as key terms
    return all(term in output.lower() for term in expected_terms if len(term) > 3)

Metric 2: Tool Call Accuracy#

Definition: Whether the agent selects the correct tool and provides well-formed, correct arguments.

Why it matters: An agent that calls the wrong tool or provides malformed arguments wastes API calls, produces wrong outputs, and can cause side effects in connected systems.

from typing import List, Dict

@dataclass
class ToolCallExpectation:
    """Expected tool call specification."""
    tool_name: str
    required_args: List[str]  # Arguments that must be present
    arg_validators: Dict[str, Callable]  # Arg name -> validation function
    must_be_called: bool = True  # Whether this tool must be called
    call_order: Optional[int] = None  # Expected position in call sequence

def evaluate_tool_call_accuracy(
    actual_tool_calls: List[Dict],
    expected_calls: List[ToolCallExpectation]
) -> Dict[str, Any]:
    """
    Evaluate the accuracy of an agent's tool calling behavior.

    Args:
        actual_tool_calls: List of {'name': str, 'args': dict} from agent run
        expected_calls: List of ToolCallExpectation specifications
    """
    metrics = {
        "tool_selection_accuracy": 0.0,    # Correct tool selected
        "argument_completeness": 0.0,       # Required args present
        "argument_validity": 0.0,           # Args pass validation
        "call_order_accuracy": 0.0,         # Calls in right order
        "spurious_call_rate": 0.0,          # Unexpected tool calls
        "details": []
    }

    actual_names = [c.get("name", "") for c in actual_tool_calls]
    expected_names = [e.tool_name for e in expected_calls if e.must_be_called]

    # Tool selection accuracy
    correct_selections = sum(1 for e in expected_names if e in actual_names)
    if expected_names:
        metrics["tool_selection_accuracy"] = correct_selections / len(expected_names)

    # Argument analysis for each expected call
    arg_scores = []
    validity_scores = []

    for expected in expected_calls:
        # Find the actual call
        actual_call = next(
            (c for c in actual_tool_calls if c.get("name") == expected.tool_name),
            None
        )

        call_detail = {
            "expected_tool": expected.tool_name,
            "called": actual_call is not None,
            "arg_completeness": 0.0,
            "arg_validity": 0.0,
            "issues": []
        }

        if actual_call:
            actual_args = actual_call.get("args", {})

            # Check required args are present
            present_count = sum(1 for arg in expected.required_args if arg in actual_args)
            completeness = present_count / len(expected.required_args) if expected.required_args else 1.0
            call_detail["arg_completeness"] = completeness
            arg_scores.append(completeness)

            # Check arg validity
            if expected.arg_validators:
                valid_count = 0
                for arg_name, validator in expected.arg_validators.items():
                    if arg_name in actual_args:
                        try:
                            is_valid = validator(actual_args[arg_name])
                            if is_valid:
                                valid_count += 1
                            else:
                                call_detail["issues"].append(f"Invalid value for arg '{arg_name}'")
                        except Exception as e:
                            call_detail["issues"].append(f"Arg '{arg_name}' validation error: {e}")
                validity = valid_count / len(expected.arg_validators)
                call_detail["arg_validity"] = validity
                validity_scores.append(validity)
        else:
            if expected.must_be_called:
                call_detail["issues"].append(f"Required tool '{expected.tool_name}' was not called")
            arg_scores.append(0.0)

        metrics["details"].append(call_detail)

    # Aggregate scores
    if arg_scores:
        metrics["argument_completeness"] = sum(arg_scores) / len(arg_scores)
    if validity_scores:
        metrics["argument_validity"] = sum(validity_scores) / len(validity_scores)

    # Spurious calls
    expected_tool_names = {e.tool_name for e in expected_calls}
    spurious = [c for c in actual_tool_calls if c.get("name") not in expected_tool_names]
    metrics["spurious_call_rate"] = len(spurious) / max(len(actual_tool_calls), 1)

    return metrics

# Example usage
test_case = {
    "input": "Search for the latest Python 3.12 release notes",
    "expected_tool_calls": [
        ToolCallExpectation(
            tool_name="web_search",
            required_args=["query"],
            arg_validators={
                "query": lambda q: "python" in q.lower() and ("3.12" in q or "release" in q.lower())
            }
        )
    ]
}

Metric 3: Hallucination Rate#

Definition: The frequency of factually incorrect or fabricated information in agent outputs.

def evaluate_hallucination_rate(
    agent_outputs: List[str],
    ground_truth_contexts: List[str],
    evaluator_llm=None
) -> Dict[str, float]:
    """
    Measure hallucination rate using LLM-as-judge.

    Args:
        agent_outputs: List of agent response strings
        ground_truth_contexts: The source material the agent should draw from
        evaluator_llm: LLM to use as judge (defaults to gpt-4o)
    """
    from langchain_openai import ChatOpenAI
    from langchain_core.messages import HumanMessage

    if evaluator_llm is None:
        evaluator_llm = ChatOpenAI(model="gpt-4o", temperature=0)

    hallucination_scores = []

    for output, context in zip(agent_outputs, ground_truth_contexts):
        eval_prompt = f"""Evaluate whether the following AI response contains hallucinations.

A hallucination is a claim in the response that:
- Is not supported by the provided context
- Contradicts the provided context
- Introduces specific facts not present in the context (names, numbers, dates, etc.)

Context (ground truth):
{context}

AI Response to evaluate:
{output}

Respond with:
HALLUCINATION_SCORE: [0.0 to 1.0 where 0.0 = no hallucinations, 1.0 = severely hallucinated]
HALLUCINATED_CLAIMS: [List any specific hallucinated claims, or "None"]
REASONING: [Brief explanation of your assessment]"""

        response = evaluator_llm.invoke([HumanMessage(content=eval_prompt)])
        content = response.content

        # Parse score
        score = 0.0
        if "HALLUCINATION_SCORE:" in content:
            try:
                score_str = content.split("HALLUCINATION_SCORE:")[1].split("\n")[0].strip()
                score = float(score_str)
            except (ValueError, IndexError):
                score = 0.5  # Default to middle if parsing fails

        hallucination_scores.append(score)

    return {
        "hallucination_rate": sum(hallucination_scores) / len(hallucination_scores),
        "max_hallucination": max(hallucination_scores),
        "hallucination_free_rate": sum(1 for s in hallucination_scores if s < 0.1) / len(hallucination_scores),
        "high_hallucination_rate": sum(1 for s in hallucination_scores if s > 0.5) / len(hallucination_scores),
        "individual_scores": hallucination_scores
    }

Metric 4: Latency#

Definition: Wall-clock time to complete a task from input to final output.

import time
import statistics
from contextlib import contextmanager

@dataclass
class LatencyRecord:
    run_id: str
    total_ms: float
    time_to_first_token_ms: Optional[float]
    tool_call_count: int
    token_count: int

@contextmanager
def measure_latency():
    """Context manager for measuring agent execution time."""
    start = time.perf_counter()
    yield
    elapsed = (time.perf_counter() - start) * 1000
    return elapsed

def benchmark_agent_latency(
    agent_fn: Callable,
    test_inputs: List[str],
    warmup_runs: int = 2
) -> Dict[str, float]:
    """
    Benchmark agent latency across multiple inputs.

    Includes warmup runs to account for cold start effects.
    """
    latencies = []

    # Warmup runs (excluded from results)
    for i in range(warmup_runs):
        if test_inputs:
            start = time.perf_counter()
            try:
                agent_fn(test_inputs[0])
            except Exception:
                pass
            _ = (time.perf_counter() - start) * 1000

    # Measured runs
    for test_input in test_inputs:
        start = time.perf_counter()
        try:
            agent_fn(test_input)
        except Exception:
            pass
        elapsed_ms = (time.perf_counter() - start) * 1000
        latencies.append(elapsed_ms)

    if not latencies:
        return {}

    return {
        "mean_latency_ms": statistics.mean(latencies),
        "median_latency_ms": statistics.median(latencies),
        "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
        "p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)],
        "min_latency_ms": min(latencies),
        "max_latency_ms": max(latencies),
        "std_latency_ms": statistics.stdev(latencies) if len(latencies) > 1 else 0,
        "total_runs": len(latencies)
    }

Metric 5: Cost Per Task#

Definition: Average USD cost in API tokens to complete one task.

from langchain.callbacks import get_openai_callback

def measure_cost_per_task(
    agent_fn: Callable,
    test_inputs: List[str]
) -> Dict[str, float]:
    """Measure API cost across test inputs."""

    costs = []
    token_counts = []

    for test_input in test_inputs:
        with get_openai_callback() as cb:
            try:
                agent_fn(test_input)
            except Exception:
                pass

        costs.append(cb.total_cost)
        token_counts.append(cb.total_tokens)

    return {
        "mean_cost_usd": statistics.mean(costs) if costs else 0,
        "total_cost_usd": sum(costs),
        "mean_tokens_per_task": statistics.mean(token_counts) if token_counts else 0,
        "cost_per_1000_tasks_usd": statistics.mean(costs) * 1000 if costs else 0,
        "monthly_cost_estimate_usd": statistics.mean(costs) * 1000 * 30 if costs else 0,
        "individual_costs": costs
    }

Full Evaluation Pipeline with LangSmith#

from langsmith import Client, evaluate
from langsmith.evaluation import EvaluationResult

client = Client()

# 1. Create an evaluation dataset
dataset_name = "agent_evaluation_suite_v1"

examples = [
    {
        "inputs": {"query": "What is the capital of France?"},
        "outputs": {"answer": "Paris"}
    },
    {
        "inputs": {"query": "Calculate 15% of $340"},
        "outputs": {"answer": "$51"}
    },
    {
        "inputs": {"query": "Summarize the key benefits of using LangGraph"},
        "outputs": {"answer": "stateful workflows, checkpointing, graph-based orchestration"}
    }
]

# Create dataset if it doesn't exist
try:
    dataset = client.read_dataset(dataset_name=dataset_name)
except Exception:
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[e["inputs"] for e in examples],
        outputs=[e["outputs"] for e in examples],
        dataset_id=dataset.id
    )

# 2. Define the agent function to evaluate
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

eval_llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

def agent_under_test(inputs: Dict) -> Dict:
    """The agent function to evaluate."""
    query = inputs["query"]
    response = eval_llm.invoke([HumanMessage(content=query)])
    return {"answer": response.content}

# 3. Define evaluators
def correctness_evaluator(run, example) -> EvaluationResult:
    """Evaluate factual correctness of agent answer."""
    actual = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")

    # Check if key terms from expected are in actual
    expected_terms = [t for t in expected.lower().split() if len(t) > 3]
    matches = sum(1 for term in expected_terms if term in actual.lower())
    score = matches / len(expected_terms) if expected_terms else 0

    return EvaluationResult(key="correctness", score=score)

def conciseness_evaluator(run, example) -> EvaluationResult:
    """Evaluate response conciseness (shorter = better for factual Q&A)."""
    actual = run.outputs.get("answer", "")
    # Score inversely proportional to length (capped at 200 words)
    word_count = len(actual.split())
    score = max(0, 1 - (word_count / 200))
    return EvaluationResult(key="conciseness", score=score)

# 4. Run evaluation
results = evaluate(
    agent_under_test,
    data=dataset_name,
    evaluators=[correctness_evaluator, conciseness_evaluator],
    experiment_prefix="baseline-gpt4o",
    metadata={"model": "gpt-4o", "version": "2026-03-01"}
)

print("Evaluation Results:")
print(f"  Correctness: {results.to_pandas()['feedback.correctness'].mean():.2%}")
print(f"  Conciseness: {results.to_pandas()['feedback.conciseness'].mean():.2%}")

Building a Comprehensive Evaluation Report#

from dataclasses import dataclass

@dataclass
class AgentEvaluationReport:
    """Comprehensive evaluation report for an AI agent."""
    agent_name: str
    evaluation_date: str
    test_case_count: int

    # Core metrics
    task_completion_rate: float
    tool_call_accuracy: float
    hallucination_rate: float
    mean_latency_ms: float
    mean_cost_usd: float

    # Derived metrics
    p95_latency_ms: float
    cost_per_1000_tasks_usd: float

    def overall_score(self) -> float:
        """Compute a weighted overall health score (0-100)."""
        weights = {
            "task_completion": 0.30,
            "tool_accuracy": 0.25,
            "hallucination_free": 0.25,
            "latency_score": 0.10,
            "cost_score": 0.10
        }

        scores = {
            "task_completion": self.task_completion_rate,
            "tool_accuracy": self.tool_call_accuracy,
            "hallucination_free": 1 - self.hallucination_rate,
            "latency_score": max(0, 1 - (self.mean_latency_ms / 10000)),  # 10s = 0 score
            "cost_score": max(0, 1 - (self.mean_cost_usd / 0.10))  # $0.10 = 0 score
        }

        return sum(scores[k] * weights[k] for k in weights) * 100

    def print_report(self) -> None:
        """Print a formatted evaluation report."""
        print(f"\n{'='*60}")
        print(f"AGENT EVALUATION REPORT: {self.agent_name}")
        print(f"Date: {self.evaluation_date} | Tests: {self.test_case_count}")
        print(f"{'='*60}")
        print(f"\nOVERALL SCORE: {self.overall_score():.1f}/100")
        print(f"\nCORE METRICS:")
        print(f"  Task Completion Rate:   {self.task_completion_rate:.1%}")
        print(f"  Tool Call Accuracy:     {self.tool_call_accuracy:.1%}")
        print(f"  Hallucination Rate:     {self.hallucination_rate:.1%}")
        print(f"  Mean Latency:           {self.mean_latency_ms:.0f}ms")
        print(f"  Mean Cost Per Task:     ${self.mean_cost_usd:.4f}")
        print(f"\nEFFICIENCY METRICS:")
        print(f"  P95 Latency:            {self.p95_latency_ms:.0f}ms")
        print(f"  Cost per 1K Tasks:      ${self.cost_per_1000_tasks_usd:.2f}")
        print(f"{'='*60}")

        # Health indicators
        print(f"\nHEALTH INDICATORS:")
        indicators = [
            ("Task Completion >= 90%", self.task_completion_rate >= 0.90),
            ("Tool Accuracy >= 85%", self.tool_call_accuracy >= 0.85),
            ("Hallucination < 10%", self.hallucination_rate < 0.10),
            ("Median Latency < 5s", self.mean_latency_ms < 5000),
            ("Cost < $0.05/task", self.mean_cost_usd < 0.05)
        ]

        for label, passed in indicators:
            status = "PASS" if passed else "FAIL"
            print(f"  [{status}] {label}")

def run_full_evaluation(agent_fn: Callable, test_suite: List[Dict]) -> AgentEvaluationReport:
    """Run all evaluations and return comprehensive report."""

    print("Running evaluation suite...")

    # Extract components from test suite
    inputs = [t["input"] for t in test_suite]
    contexts = [t.get("context", "") for t in test_suite]

    # Run all metrics in parallel where possible
    print("  Measuring task completion...")
    completion_results = measure_task_completion(
        agent_fn,
        test_suite,
        lambda out, exp: bool(out) and len(out) > 10  # Simple validator
    )

    print("  Measuring latency...")
    latency_results = benchmark_agent_latency(agent_fn, inputs[:10])  # Sample for speed

    print("  Measuring cost...")
    cost_results = measure_cost_per_task(agent_fn, inputs[:10])  # Sample for cost

    print("  Evaluating hallucinations (this may take a moment)...")
    outputs = []
    for inp in inputs[:10]:
        try:
            outputs.append(str(agent_fn(inp)))
        except Exception:
            outputs.append("")

    hallucination_results = evaluate_hallucination_rate(
        outputs,
        contexts[:10] if contexts else [""] * len(outputs)
    )

    return AgentEvaluationReport(
        agent_name="My Agent v1",
        evaluation_date=datetime.now().strftime("%Y-%m-%d"),
        test_case_count=len(test_suite),
        task_completion_rate=completion_results["task_completion_rate"],
        tool_call_accuracy=0.85,  # Placeholder — compute from tool_call evaluator
        hallucination_rate=hallucination_results["hallucination_rate"],
        mean_latency_ms=latency_results.get("mean_latency_ms", 0),
        mean_cost_usd=cost_results.get("mean_cost_usd", 0),
        p95_latency_ms=latency_results.get("p95_latency_ms", 0),
        cost_per_1000_tasks_usd=cost_results.get("cost_per_1000_tasks_usd", 0)
    )

# Example usage
test_suite = [
    {"input": "What is the largest planet in the solar system?", "context": "Jupiter is the largest planet."},
    {"input": "Summarize the key benefits of TypeScript", "context": "TypeScript provides static typing, better IDE support..."},
    {"input": "How do I reverse a list in Python?", "context": "Use list[::-1] or list.reverse() in Python"},
]

simple_agent = lambda inp: eval_llm.invoke([HumanMessage(content=inp)]).content

report = run_full_evaluation(simple_agent, test_suite)
report.print_report()

Setting Up Continuous Evaluation in CI/CD#

# eval_pipeline.py — Run in CI/CD
import sys

def run_ci_evaluation():
    """Run evaluation and fail CI if metrics below threshold."""
    report = run_full_evaluation(agent_fn, load_test_suite())

    thresholds = {
        "task_completion_rate": 0.85,
        "hallucination_rate": 0.15,  # Maximum allowed
        "mean_latency_ms": 8000      # Maximum allowed
    }

    failures = []

    if report.task_completion_rate < thresholds["task_completion_rate"]:
        failures.append(f"Task completion {report.task_completion_rate:.1%} below {thresholds['task_completion_rate']:.1%}")

    if report.hallucination_rate > thresholds["hallucination_rate"]:
        failures.append(f"Hallucination rate {report.hallucination_rate:.1%} above {thresholds['hallucination_rate']:.1%}")

    if report.mean_latency_ms > thresholds["mean_latency_ms"]:
        failures.append(f"Latency {report.mean_latency_ms:.0f}ms above {thresholds['mean_latency_ms']}ms")

    report.print_report()

    if failures:
        print("\nCI FAILURES:")
        for failure in failures:
            print(f"  FAILED: {failure}")
        sys.exit(1)  # Fail CI pipeline

    print("\nAll evaluation thresholds passed.")
    sys.exit(0)

if __name__ == "__main__":
    run_ci_evaluation()

Key Takeaways#

Building a robust agent evaluation system requires:

Task completion rate — start here, it is the most important metric for production readiness
Tool call accuracy — validate not just that tools are called, but that arguments are correct
Hallucination rate — use LLM-as-judge for subjective assessment of factual grounding
Latency — measure P95 and P99, not just mean — tail latency kills user experience
Cost per task — project at scale; $0.05/task sounds small until it is $5,000/day at 100K tasks
Continuous evaluation — integrate your eval suite into CI/CD so regressions are caught before deployment

For complementary topics, see our guides on AI Agent Testing Tools, Agent Observability with Langfuse, and Agentic RAG Evaluation.

AI Agent Evaluation Metrics (2026 Guide)

AI Agent Evaluation Metrics: How to Measure Agent Performance

The Five Core Agent Evaluation Metrics#

Metric 1: Task Completion Rate#

Metric 2: Tool Call Accuracy#

Metric 3: Hallucination Rate#

Metric 4: Latency#

Metric 5: Cost Per Task#

Full Evaluation Pipeline with LangSmith#

Building a Comprehensive Evaluation Report#

Setting Up Continuous Evaluation in CI/CD#

Key Takeaways#

AI Agent Evaluation Metrics (2026 Guide)

AI Agent Evaluation Metrics: How to Measure Agent Performance

The Five Core Agent Evaluation Metrics#

Metric 1: Task Completion Rate#

Metric 2: Tool Call Accuracy#

Metric 3: Hallucination Rate#

Metric 4: Latency#

Metric 5: Cost Per Task#

Full Evaluation Pipeline with LangSmith#

Building a Comprehensive Evaluation Report#

Setting Up Continuous Evaluation in CI/CD#

Key Takeaways#