How does agent tracing differ from logging?

Logging captures discrete events (a tool was called, an error occurred). Tracing captures the causal relationships between events — which LLM call led to which tool call, how long each step took, and how they contribute to the overall task execution.

What tools support AI agent tracing?

LangSmith, Langfuse, Arize Phoenix, and Weights & Biases support agent tracing with varying levels of OpenTelemetry compatibility. OpenTelemetry itself can be used for custom tracing that integrates with existing observability infrastructure.

Code and performance metrics on screen representing AI agent tracing — Photo by Nubelson Fernandes on Unsplash

What Is Agent Tracing?

Q: What is agent tracing?

Agent tracing is the practice of recording the complete execution path of an AI agent, capturing each LLM call, tool invocation, and decision point as structured spans in a distributed trace. It makes agent behavior inspectable and debuggable.

Quick Definition#

Agent tracing is the collection of structured telemetry from AI agent executions — capturing LLM calls, tool invocations, latency measurements, token usage, and reasoning chains in a hierarchical structure that enables debugging, performance optimization, and production monitoring. Tracing gives developers visibility into what their agents actually do at runtime, making it possible to diagnose failures, identify bottlenecks, and continuously improve agent behavior.

Browse all AI agent terms in the AI Agent Glossary. For the compliance-focused counterpart to tracing, see Agent Audit Trail. For the execution infrastructure being traced, see Agent Runtime.

Why Agent Tracing Is Different from Traditional Observability#

Traditional application tracing captures function calls, HTTP requests, and database queries. Agent tracing captures something more complex: a probabilistic reasoning process where the path from input to output is not deterministic and involves multiple model calls, tool invocations, and context accumulations.

Key differences:

Dimension	Traditional Tracing	Agent Tracing
Path	Deterministic call stack	Probabilistic reasoning chain
Primary data	Method calls, HTTP requests	LLM calls + tool invocations
Debugging	Stacktrace → root cause	Conversation history → reasoning failure
Performance	Request latency	LLM call latency + token cost
Quality	Error rate	Response quality + hallucination rate
Tools	Datadog, Jaeger	LangSmith, Langfuse, Arize Phoenix

The fundamental difference: agent debugging requires understanding why the agent reasoned in a particular way, which requires seeing the conversation context at each step — not just the technical call sequence.

The Trace Hierarchy#

Agent traces are hierarchical: a session contains runs, runs contain LLM calls and tool calls, which may spawn child runs:

Session (user_id: u123, session_id: s456)
├── Run: "research quarterly report"
│   ├── LLM Call: Planning (120ms, 450 tokens)
│   │   └── Tool Decision: web_search
│   ├── Tool Call: web_search("Q3 earnings tech companies")
│   │   └── Duration: 1.2s, results: 5 items
│   ├── LLM Call: Analysis (340ms, 820 tokens)
│   │   └── Tool Decision: web_search (second query)
│   ├── Tool Call: web_search("AI company earnings Q3 2026")
│   │   └── Duration: 0.9s, results: 3 items
│   └── LLM Call: Synthesis (520ms, 1250 tokens)
│       └── Stop: end_turn → Final output
└── Metrics: Total 3.2s, 2520 tokens, 2 tool calls

Implementing Agent Tracing#

Manual Tracing with Langfuse#

from langfuse import Langfuse
from anthropic import Anthropic
import time

langfuse = Langfuse()
anthropic_client = Anthropic()

def run_traced_agent(user_message: str,
                     session_id: str = None) -> str:
    """Run agent with Langfuse tracing."""
    # Create a trace for the full session
    trace = langfuse.trace(
        name="agent-run",
        input={"user_message": user_message},
        session_id=session_id,
        tags=["production", "v1"]
    )

    messages = [{"role": "user", "content": user_message}]
    tools = [web_search_tool_schema]  # Define elsewhere
    step = 0

    try:
        for _ in range(10):
            step += 1

            # Create span for LLM call
            llm_span = trace.span(
                name=f"llm-call-{step}",
                input={"messages": messages, "step": step}
            )

            start_time = time.time()
            response = anthropic_client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                messages=messages,
                tools=tools
            )
            latency_ms = int((time.time() - start_time) * 1000)

            # Update span with LLM call results
            response_text = next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )
            llm_span.update(
                output={"response": response_text, "stop_reason": response.stop_reason},
                metadata={
                    "model": "claude-opus-4-6",
                    "input_tokens": response.usage.input_tokens,
                    "output_tokens": response.usage.output_tokens,
                    "latency_ms": latency_ms
                }
            )
            llm_span.end()

            if response.stop_reason == "end_turn":
                # Final output — score it
                trace.score(
                    name="agent-completion",
                    value=1,
                    comment="Agent reached final answer"
                )
                trace.update(output={"final_answer": response_text})
                return response_text

            # Trace tool calls
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []

            for block in response.content:
                if block.type == "tool_use":
                    tool_span = trace.span(
                        name=f"tool-{block.name}-{step}",
                        input={"tool": block.name, "input": block.input}
                    )

                    tool_start = time.time()
                    try:
                        result = execute_tool(block.name, block.input)
                        tool_duration = int((time.time() - tool_start) * 1000)
                        tool_span.update(
                            output={"result": str(result)[:500]},  # Truncate for storage
                            metadata={"duration_ms": tool_duration, "status": "success"}
                        )
                    except Exception as e:
                        tool_span.update(
                            output={"error": str(e)},
                            metadata={"status": "error"}
                        )
                        result = f"Error: {e}"
                    finally:
                        tool_span.end()

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result)
                    })

            messages.append({"role": "user", "content": tool_results})

    except Exception as e:
        trace.update(output={"error": str(e)}, metadata={"status": "failed"})
        raise

    return "Max iterations reached"

OpenTelemetry-Based Tracing#

For teams with existing OTel infrastructure:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import json

# Setup OTel tracer
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent", "1.0.0")

def otel_traced_llm_call(model: str, messages: list,
                          tools: list = None) -> dict:
    """Make LLM call with OpenTelemetry spans."""
    with tracer.start_as_current_span("llm.call") as span:
        # Standard GenAI semantic conventions
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 2048)
        span.set_attribute("llm.request.messages_count", len(messages))

        client = Anthropic()
        response = client.messages.create(
            model=model,
            max_tokens=2048,
            messages=messages,
            tools=tools or []
        )

        # Record response metadata
        span.set_attribute("gen_ai.usage.input_tokens",
                           response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens",
                           response.usage.output_tokens)
        span.set_attribute("gen_ai.response.stop_reason",
                           response.stop_reason)

        return response

Key Metrics to Track#

from dataclasses import dataclass, field
from typing import List

@dataclass
class AgentSessionMetrics:
    """Aggregate metrics for an agent session."""
    session_id: str

    # LLM costs
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    estimated_cost_usd: float = 0.0

    # Latency
    total_latency_ms: int = 0
    llm_latency_ms: int = 0
    tool_latency_ms: int = 0

    # Quality
    num_tool_calls: int = 0
    num_llm_calls: int = 0
    num_errors: int = 0
    reached_final_answer: bool = False

    # Model usage
    models_used: List[str] = field(default_factory=list)
    tools_called: List[str] = field(default_factory=list)

    def token_cost_estimate(self, input_cost_per_1m: float = 15.0,
                            output_cost_per_1m: float = 75.0) -> float:
        """Estimate cost in USD (claude-opus-4-6 rates)."""
        return (self.total_input_tokens / 1_000_000 * input_cost_per_1m +
                self.total_output_tokens / 1_000_000 * output_cost_per_1m)

Debugging with Traces#

The primary value of tracing in development is debugging unexpected agent behavior. When an agent produces a wrong result, the trace reveals:

At which step the reasoning diverged from expectation
What context the agent had at that step
What tool calls returned and whether results were valid
Whether token limits caused context truncation
Which LLM call produced incorrect reasoning

Without tracing, reproducing and diagnosing agent failures requires guesswork. With full traces, failures become inspectable.

Common Misconceptions#

Misconception: Tracing is only needed in production Development tracing catches agent behavior issues before they reach production. Traces reveal when agents loop unnecessarily, miss relevant tool results, or produce low-quality outputs due to prompt issues. Many teams use tracing from the first prototype.

Misconception: Tracing adds significant overhead Well-implemented tracing adds 1-5ms per event to agent runtime. For agents with 2-10 second LLM call latencies, this is negligible. Asynchronous trace export (standard in LangSmith and Langfuse) ensures tracing doesn't block agent execution.

Misconception: I can debug agents from their outputs alone Agent failures are often invisible from output alone — the agent might produce a plausible but incorrect answer without any error signals. Tracing reveals the reasoning path that led to the output, which is often where the problem actually is.

Agent Audit Trail — Compliance-focused counterpart to operational tracing
Agent Runtime — The execution engine being traced
Agent Loop — The reasoning cycle that generates trace events
Agent State — The state data captured in traces
Agentic Workflow — Multi-step workflows requiring comprehensive trace coverage
Build Your First AI Agent — Tutorial covering observability setup for agents
LangChain vs AutoGen — How frameworks support native tracing integration

Frequently Asked Questions#

What is agent tracing?#

Agent tracing is systematic collection of telemetry from AI agent executions — capturing LLM API calls, tool invocations, token usage, and latency in hierarchical spans. It enables developers to understand exactly what happened in any agent run, debug failures, optimize performance, and monitor agents in production.

How is agent tracing different from an audit trail?#

Tracing is primarily an observability tool for developers: debugging, performance optimization, and monitoring. Audit trails are compliance records: immutable logs for regulatory and accountability purposes. Good tracing systems like LangSmith and Langfuse often serve both needs, but the emphasis differs — tracing focuses on real-time analysis; audit trails focus on completeness and immutability.

What tools are available for agent tracing?#

LangSmith (deep LangChain/LangGraph integration), Langfuse (open-source, framework-agnostic), Arize Phoenix (AI observability), and OpenTelemetry with GenAI semantic conventions (for existing OTel infrastructure). Most work by wrapping LLM calls and tool executions to capture structured telemetry.

What should I trace in an AI agent?#

Start with: every LLM call (model, input/output tokens, latency, inputs, outputs), every tool call (name, inputs, outputs, execution time), errors with context, and session-level aggregates (total tokens, total latency, number of turns). Add custom attributes relevant to your application's specific debugging and monitoring needs.

What Is Agent Tracing?

Quick Definition#

Browse all AI agent terms in the AI Agent Glossary. For the compliance-focused counterpart to tracing, see Agent Audit Trail. For the execution infrastructure being traced, see Agent Runtime.

Why Agent Tracing Is Different from Traditional Observability#

Key differences:

Dimension	Traditional Tracing	Agent Tracing
Path	Deterministic call stack	Probabilistic reasoning chain
Primary data	Method calls, HTTP requests	LLM calls + tool invocations
Debugging	Stacktrace → root cause	Conversation history → reasoning failure
Performance	Request latency	LLM call latency + token cost
Quality	Error rate	Response quality + hallucination rate
Tools	Datadog, Jaeger	LangSmith, Langfuse, Arize Phoenix

The Trace Hierarchy#

Agent traces are hierarchical: a session contains runs, runs contain LLM calls and tool calls, which may spawn child runs:

Session (user_id: u123, session_id: s456)
├── Run: "research quarterly report"
│   ├── LLM Call: Planning (120ms, 450 tokens)
│   │   └── Tool Decision: web_search
│   ├── Tool Call: web_search("Q3 earnings tech companies")
│   │   └── Duration: 1.2s, results: 5 items
│   ├── LLM Call: Analysis (340ms, 820 tokens)
│   │   └── Tool Decision: web_search (second query)
│   ├── Tool Call: web_search("AI company earnings Q3 2026")
│   │   └── Duration: 0.9s, results: 3 items
│   └── LLM Call: Synthesis (520ms, 1250 tokens)
│       └── Stop: end_turn → Final output
└── Metrics: Total 3.2s, 2520 tokens, 2 tool calls

Implementing Agent Tracing#

Manual Tracing with Langfuse#

from langfuse import Langfuse
from anthropic import Anthropic
import time

langfuse = Langfuse()
anthropic_client = Anthropic()

def run_traced_agent(user_message: str,
                     session_id: str = None) -> str:
    """Run agent with Langfuse tracing."""
    # Create a trace for the full session
    trace = langfuse.trace(
        name="agent-run",
        input={"user_message": user_message},
        session_id=session_id,
        tags=["production", "v1"]
    )

    messages = [{"role": "user", "content": user_message}]
    tools = [web_search_tool_schema]  # Define elsewhere
    step = 0

    try:
        for _ in range(10):
            step += 1

            # Create span for LLM call
            llm_span = trace.span(
                name=f"llm-call-{step}",
                input={"messages": messages, "step": step}
            )

            start_time = time.time()
            response = anthropic_client.messages.create(
                model="claude-opus-4-6",
                max_tokens=2048,
                messages=messages,
                tools=tools
            )
            latency_ms = int((time.time() - start_time) * 1000)

            # Update span with LLM call results
            response_text = next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )
            llm_span.update(
                output={"response": response_text, "stop_reason": response.stop_reason},
                metadata={
                    "model": "claude-opus-4-6",
                    "input_tokens": response.usage.input_tokens,
                    "output_tokens": response.usage.output_tokens,
                    "latency_ms": latency_ms
                }
            )
            llm_span.end()

            if response.stop_reason == "end_turn":
                # Final output — score it
                trace.score(
                    name="agent-completion",
                    value=1,
                    comment="Agent reached final answer"
                )
                trace.update(output={"final_answer": response_text})
                return response_text

            # Trace tool calls
            messages.append({"role": "assistant", "content": response.content})
            tool_results = []

            for block in response.content:
                if block.type == "tool_use":
                    tool_span = trace.span(
                        name=f"tool-{block.name}-{step}",
                        input={"tool": block.name, "input": block.input}
                    )

                    tool_start = time.time()
                    try:
                        result = execute_tool(block.name, block.input)
                        tool_duration = int((time.time() - tool_start) * 1000)
                        tool_span.update(
                            output={"result": str(result)[:500]},  # Truncate for storage
                            metadata={"duration_ms": tool_duration, "status": "success"}
                        )
                    except Exception as e:
                        tool_span.update(
                            output={"error": str(e)},
                            metadata={"status": "error"}
                        )
                        result = f"Error: {e}"
                    finally:
                        tool_span.end()

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result)
                    })

            messages.append({"role": "user", "content": tool_results})

    except Exception as e:
        trace.update(output={"error": str(e)}, metadata={"status": "failed"})
        raise

    return "Max iterations reached"

OpenTelemetry-Based Tracing#

For teams with existing OTel infrastructure:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import json

# Setup OTel tracer
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent", "1.0.0")

def otel_traced_llm_call(model: str, messages: list,
                          tools: list = None) -> dict:
    """Make LLM call with OpenTelemetry spans."""
    with tracer.start_as_current_span("llm.call") as span:
        # Standard GenAI semantic conventions
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 2048)
        span.set_attribute("llm.request.messages_count", len(messages))

        client = Anthropic()
        response = client.messages.create(
            model=model,
            max_tokens=2048,
            messages=messages,
            tools=tools or []
        )

        # Record response metadata
        span.set_attribute("gen_ai.usage.input_tokens",
                           response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens",
                           response.usage.output_tokens)
        span.set_attribute("gen_ai.response.stop_reason",
                           response.stop_reason)

        return response

Key Metrics to Track#

from dataclasses import dataclass, field
from typing import List

@dataclass
class AgentSessionMetrics:
    """Aggregate metrics for an agent session."""
    session_id: str

    # LLM costs
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    estimated_cost_usd: float = 0.0

    # Latency
    total_latency_ms: int = 0
    llm_latency_ms: int = 0
    tool_latency_ms: int = 0

    # Quality
    num_tool_calls: int = 0
    num_llm_calls: int = 0
    num_errors: int = 0
    reached_final_answer: bool = False

    # Model usage
    models_used: List[str] = field(default_factory=list)
    tools_called: List[str] = field(default_factory=list)

    def token_cost_estimate(self, input_cost_per_1m: float = 15.0,
                            output_cost_per_1m: float = 75.0) -> float:
        """Estimate cost in USD (claude-opus-4-6 rates)."""
        return (self.total_input_tokens / 1_000_000 * input_cost_per_1m +
                self.total_output_tokens / 1_000_000 * output_cost_per_1m)

Debugging with Traces#

The primary value of tracing in development is debugging unexpected agent behavior. When an agent produces a wrong result, the trace reveals:

At which step the reasoning diverged from expectation
What context the agent had at that step
What tool calls returned and whether results were valid
Whether token limits caused context truncation
Which LLM call produced incorrect reasoning

Without tracing, reproducing and diagnosing agent failures requires guesswork. With full traces, failures become inspectable.

Common Misconceptions#

Agent Audit Trail — Compliance-focused counterpart to operational tracing
Agent Runtime — The execution engine being traced
Agent Loop — The reasoning cycle that generates trace events
Agent State — The state data captured in traces
Agentic Workflow — Multi-step workflows requiring comprehensive trace coverage
Build Your First AI Agent — Tutorial covering observability setup for agents
LangChain vs AutoGen — How frameworks support native tracing integration

What Is Agent Tracing?

Term Snapshot

What Is Agent Tracing?

Quick Definition#

Why Agent Tracing Is Different from Traditional Observability#

The Trace Hierarchy#

Implementing Agent Tracing#

Manual Tracing with Langfuse#

OpenTelemetry-Based Tracing#

Key Metrics to Track#

Debugging with Traces#

Common Misconceptions#

Frequently Asked Questions#

What is agent tracing?#

How is agent tracing different from an audit trail?#

What tools are available for agent tracing?#

What should I trace in an AI agent?#

What Is Agent Tracing?

Term Snapshot

What Is Agent Tracing?

Quick Definition#

Why Agent Tracing Is Different from Traditional Observability#

The Trace Hierarchy#

Implementing Agent Tracing#

Manual Tracing with Langfuse#

OpenTelemetry-Based Tracing#

Key Metrics to Track#

Debugging with Traces#

Common Misconceptions#

Frequently Asked Questions#

What is agent tracing?#

How is agent tracing different from an audit trail?#

What tools are available for agent tracing?#

What should I trace in an AI agent?#

Term Snapshot

What Is Agent Tracing?

Quick Definition#

Why Agent Tracing Is Different from Traditional Observability#

The Trace Hierarchy#

Implementing Agent Tracing#

Manual Tracing with Langfuse#

OpenTelemetry-Based Tracing#

Key Metrics to Track#

Debugging with Traces#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is agent tracing?#

How is agent tracing different from an audit trail?#

What tools are available for agent tracing?#

What should I trace in an AI agent?#

Term Snapshot

What Is Agent Tracing?

Quick Definition#

Why Agent Tracing Is Different from Traditional Observability#

The Trace Hierarchy#

Implementing Agent Tracing#

Manual Tracing with Langfuse#

OpenTelemetry-Based Tracing#

Key Metrics to Track#

Debugging with Traces#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is agent tracing?#

How is agent tracing different from an audit trail?#

What tools are available for agent tracing?#

What should I trace in an AI agent?#