What Is Agent Tracing?
Quick Definition#
Agent tracing is the collection of structured telemetry from AI agent executions — capturing LLM calls, tool invocations, latency measurements, token usage, and reasoning chains in a hierarchical structure that enables debugging, performance optimization, and production monitoring. Tracing gives developers visibility into what their agents actually do at runtime, making it possible to diagnose failures, identify bottlenecks, and continuously improve agent behavior.
Browse all AI agent terms in the AI Agent Glossary. For the compliance-focused counterpart to tracing, see Agent Audit Trail. For the execution infrastructure being traced, see Agent Runtime.
Why Agent Tracing Is Different from Traditional Observability#
Traditional application tracing captures function calls, HTTP requests, and database queries. Agent tracing captures something more complex: a probabilistic reasoning process where the path from input to output is not deterministic and involves multiple model calls, tool invocations, and context accumulations.
Key differences:
| Dimension | Traditional Tracing | Agent Tracing |
|---|---|---|
| Path | Deterministic call stack | Probabilistic reasoning chain |
| Primary data | Method calls, HTTP requests | LLM calls + tool invocations |
| Debugging | Stacktrace → root cause | Conversation history → reasoning failure |
| Performance | Request latency | LLM call latency + token cost |
| Quality | Error rate | Response quality + hallucination rate |
| Tools | Datadog, Jaeger | LangSmith, Langfuse, Arize Phoenix |
The fundamental difference: agent debugging requires understanding why the agent reasoned in a particular way, which requires seeing the conversation context at each step — not just the technical call sequence.
The Trace Hierarchy#
Agent traces are hierarchical: a session contains runs, runs contain LLM calls and tool calls, which may spawn child runs:
Session (user_id: u123, session_id: s456)
├── Run: "research quarterly report"
│ ├── LLM Call: Planning (120ms, 450 tokens)
│ │ └── Tool Decision: web_search
│ ├── Tool Call: web_search("Q3 earnings tech companies")
│ │ └── Duration: 1.2s, results: 5 items
│ ├── LLM Call: Analysis (340ms, 820 tokens)
│ │ └── Tool Decision: web_search (second query)
│ ├── Tool Call: web_search("AI company earnings Q3 2026")
│ │ └── Duration: 0.9s, results: 3 items
│ └── LLM Call: Synthesis (520ms, 1250 tokens)
│ └── Stop: end_turn → Final output
└── Metrics: Total 3.2s, 2520 tokens, 2 tool calls
Implementing Agent Tracing#
Manual Tracing with Langfuse#
from langfuse import Langfuse
from anthropic import Anthropic
import time
langfuse = Langfuse()
anthropic_client = Anthropic()
def run_traced_agent(user_message: str,
session_id: str = None) -> str:
"""Run agent with Langfuse tracing."""
# Create a trace for the full session
trace = langfuse.trace(
name="agent-run",
input={"user_message": user_message},
session_id=session_id,
tags=["production", "v1"]
)
messages = [{"role": "user", "content": user_message}]
tools = [web_search_tool_schema] # Define elsewhere
step = 0
try:
for _ in range(10):
step += 1
# Create span for LLM call
llm_span = trace.span(
name=f"llm-call-{step}",
input={"messages": messages, "step": step}
)
start_time = time.time()
response = anthropic_client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
messages=messages,
tools=tools
)
latency_ms = int((time.time() - start_time) * 1000)
# Update span with LLM call results
response_text = next(
(b.text for b in response.content if hasattr(b, "text")), ""
)
llm_span.update(
output={"response": response_text, "stop_reason": response.stop_reason},
metadata={
"model": "claude-opus-4-6",
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"latency_ms": latency_ms
}
)
llm_span.end()
if response.stop_reason == "end_turn":
# Final output — score it
trace.score(
name="agent-completion",
value=1,
comment="Agent reached final answer"
)
trace.update(output={"final_answer": response_text})
return response_text
# Trace tool calls
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
tool_span = trace.span(
name=f"tool-{block.name}-{step}",
input={"tool": block.name, "input": block.input}
)
tool_start = time.time()
try:
result = execute_tool(block.name, block.input)
tool_duration = int((time.time() - tool_start) * 1000)
tool_span.update(
output={"result": str(result)[:500]}, # Truncate for storage
metadata={"duration_ms": tool_duration, "status": "success"}
)
except Exception as e:
tool_span.update(
output={"error": str(e)},
metadata={"status": "error"}
)
result = f"Error: {e}"
finally:
tool_span.end()
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result)
})
messages.append({"role": "user", "content": tool_results})
except Exception as e:
trace.update(output={"error": str(e)}, metadata={"status": "failed"})
raise
return "Max iterations reached"
OpenTelemetry-Based Tracing#
For teams with existing OTel infrastructure:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import json
# Setup OTel tracer
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent", "1.0.0")
def otel_traced_llm_call(model: str, messages: list,
tools: list = None) -> dict:
"""Make LLM call with OpenTelemetry spans."""
with tracer.start_as_current_span("llm.call") as span:
# Standard GenAI semantic conventions
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.max_tokens", 2048)
span.set_attribute("llm.request.messages_count", len(messages))
client = Anthropic()
response = client.messages.create(
model=model,
max_tokens=2048,
messages=messages,
tools=tools or []
)
# Record response metadata
span.set_attribute("gen_ai.usage.input_tokens",
response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens",
response.usage.output_tokens)
span.set_attribute("gen_ai.response.stop_reason",
response.stop_reason)
return response
Key Metrics to Track#
from dataclasses import dataclass, field
from typing import List
@dataclass
class AgentSessionMetrics:
"""Aggregate metrics for an agent session."""
session_id: str
# LLM costs
total_input_tokens: int = 0
total_output_tokens: int = 0
estimated_cost_usd: float = 0.0
# Latency
total_latency_ms: int = 0
llm_latency_ms: int = 0
tool_latency_ms: int = 0
# Quality
num_tool_calls: int = 0
num_llm_calls: int = 0
num_errors: int = 0
reached_final_answer: bool = False
# Model usage
models_used: List[str] = field(default_factory=list)
tools_called: List[str] = field(default_factory=list)
def token_cost_estimate(self, input_cost_per_1m: float = 15.0,
output_cost_per_1m: float = 75.0) -> float:
"""Estimate cost in USD (claude-opus-4-6 rates)."""
return (self.total_input_tokens / 1_000_000 * input_cost_per_1m +
self.total_output_tokens / 1_000_000 * output_cost_per_1m)
Debugging with Traces#
The primary value of tracing in development is debugging unexpected agent behavior. When an agent produces a wrong result, the trace reveals:
- At which step the reasoning diverged from expectation
- What context the agent had at that step
- What tool calls returned and whether results were valid
- Whether token limits caused context truncation
- Which LLM call produced incorrect reasoning
Without tracing, reproducing and diagnosing agent failures requires guesswork. With full traces, failures become inspectable.
Common Misconceptions#
Misconception: Tracing is only needed in production Development tracing catches agent behavior issues before they reach production. Traces reveal when agents loop unnecessarily, miss relevant tool results, or produce low-quality outputs due to prompt issues. Many teams use tracing from the first prototype.
Misconception: Tracing adds significant overhead Well-implemented tracing adds 1-5ms per event to agent runtime. For agents with 2-10 second LLM call latencies, this is negligible. Asynchronous trace export (standard in LangSmith and Langfuse) ensures tracing doesn't block agent execution.
Misconception: I can debug agents from their outputs alone Agent failures are often invisible from output alone — the agent might produce a plausible but incorrect answer without any error signals. Tracing reveals the reasoning path that led to the output, which is often where the problem actually is.
Related Terms#
- Agent Audit Trail — Compliance-focused counterpart to operational tracing
- Agent Runtime — The execution engine being traced
- Agent Loop — The reasoning cycle that generates trace events
- Agent State — The state data captured in traces
- Agentic Workflow — Multi-step workflows requiring comprehensive trace coverage
- Build Your First AI Agent — Tutorial covering observability setup for agents
- LangChain vs AutoGen — How frameworks support native tracing integration
Frequently Asked Questions#
What is agent tracing?#
Agent tracing is systematic collection of telemetry from AI agent executions — capturing LLM API calls, tool invocations, token usage, and latency in hierarchical spans. It enables developers to understand exactly what happened in any agent run, debug failures, optimize performance, and monitor agents in production.
How is agent tracing different from an audit trail?#
Tracing is primarily an observability tool for developers: debugging, performance optimization, and monitoring. Audit trails are compliance records: immutable logs for regulatory and accountability purposes. Good tracing systems like LangSmith and Langfuse often serve both needs, but the emphasis differs — tracing focuses on real-time analysis; audit trails focus on completeness and immutability.
What tools are available for agent tracing?#
LangSmith (deep LangChain/LangGraph integration), Langfuse (open-source, framework-agnostic), Arize Phoenix (AI observability), and OpenTelemetry with GenAI semantic conventions (for existing OTel infrastructure). Most work by wrapping LLM calls and tool executions to capture structured telemetry.
What should I trace in an AI agent?#
Start with: every LLM call (model, input/output tokens, latency, inputs, outputs), every tool call (name, inputs, outputs, execution time), errors with context, and session-level aggregates (total tokens, total latency, number of turns). Add custom attributes relevant to your application's specific debugging and monitoring needs.