What is the difference between a trace and a span in agent observability?

A trace represents the complete execution path of an agent completing one task, from start to finish. A span is a single unit of work within that trace — for example, one LLM call, one tool invocation, or one retrieval step. Traces are composed of nested spans, and together they give a complete picture of what the agent did and how long each part took.

Performance analytics graphs and charts on a laptop screen representing agent monitoring and observability — Photo by Luke Chesser on Unsplash

What Is Agent Observability?

Q: What is agent observability and why does it matter?

Agent observability is the ability to see, trace, and understand everything an AI agent does during task execution — which tools it called, what reasoning it produced, how long each step took, and where failures occurred. Without observability, debugging production agent failures is nearly impossible because the root cause could be anywhere across dozens of steps and tool calls.

Q: What are the best observability tools for AI agents?

LangSmith (from LangChain), Langfuse (open-source), and Arize are the most commonly used agent observability platforms. LangSmith integrates tightly with LangChain and LangGraph. Langfuse is framework-agnostic and self-hostable. Arize provides deeper ML monitoring capabilities including drift detection and performance analytics.

Quick Definition#

Agent observability is the practice of capturing, storing, and analyzing detailed information about every action an AI agent takes during execution. This includes the inputs and outputs of each LLM call, every tool invocation and its result, token consumption at each step, latency measurements, and the reasoning chains that connected decisions. Without this data, production agents are effectively black boxes — when something goes wrong, there is no way to know why.

Before going deeper, review the foundational concepts at What Are AI Agents? and The Agent Loop. Browse the full AI Agents Glossary for all operations and monitoring terms.

Why Agent Observability Is Critical#

An AI agent in production may complete hundreds of steps across a single task: multiple LLM calls, several tool invocations, memory retrievals, and routing decisions. Any one of these steps can fail, produce unexpected output, or behave differently than during testing.

Without observability:

A failing agent produces no actionable error information
Debugging requires guessing which step caused the problem
Regressions go undetected until users report failures
Cost overruns from runaway token consumption are invisible until a billing surprise
Teams cannot confidently release changes because they cannot measure their impact

With observability:

Every failure has a traceable root cause
Performance regressions are detected automatically
Cost trends are visible per workflow and per step
Quality metrics are measurable over time
Teams can confidently optimize and iterate

For a connected view of agent quality measurement, see Agent Evaluation.

Traces and Spans#

The two fundamental concepts in agent observability are traces and spans, borrowed from distributed systems monitoring.

Trace#

A trace is the complete record of an agent's execution for a single task. It captures the full lifecycle from initial input to final output, including every intermediate step. A trace answers: "What did the agent do to complete this task?"

Span#

A span is a single, discrete unit of work within a trace. In agent contexts, spans represent individual operations:

LLM call span: one inference request to a language model, with input prompt, output, token counts, latency, and model name
Tool call span: one function execution, with arguments, result, and execution time
Retrieval span: one vector search or database lookup, with query, results, and similarity scores
Chain span: a logical grouping of multiple spans that represent one stage of the workflow

Spans nest within traces, giving you both a high-level view of the task execution and the ability to drill down into any specific step.

Key Data Points to Capture#

Effective observability requires capturing the right data at each span. Core data points:

For LLM call spans:

Input prompt (full text)
Output (full text)
Token counts (input, output, total)
Model name and version
Latency
Temperature and sampling parameters
Any error or refusal indicators

For tool call spans:

Tool name
Arguments passed
Return value or error
Execution latency
Success or failure status

For retrieval spans:

Query text
Top-k results returned
Similarity scores
Source metadata (document ID, chunk ID)
Latency

At the trace level:

Total wall time
Total token cost
Task success or failure
Final output
Any human intervention events

Observability Tools for AI Agents#

LangSmith#

LangSmith is LangChain's first-party observability platform. It integrates automatically with LangChain and LangGraph applications with minimal setup — typically a few environment variables. Features include trace visualization, automatic span capture, evaluation tools, and a dataset builder for creating test suites from production traces.

For teams already using LangChain, LangSmith provides the lowest-friction path to observability. See Build an AI Agent with LangChain for integration details.

Langfuse#

Langfuse is an open-source, framework-agnostic observability platform. It can be self-hosted or used as a cloud service, supports any LLM provider, and integrates via an SDK with manual instrumentation or through LangChain's callback system. Langfuse also includes evaluation workflows, prompt management, and user feedback capture.

Teams that need control over their data infrastructure or use custom frameworks often choose Langfuse.

Arize#

Arize is a broader ML observability platform that includes LLM and agent tracing capabilities alongside traditional ML monitoring features. It provides drift detection, performance analytics, and compliance tooling that goes beyond standard agent tracing. Teams with existing Arize deployments for other ML systems can extend coverage to agents.

Connecting Observability to Evaluation#

Observability data is the raw material for Agent Evaluation. Production traces can be:

Exported as test cases for regression test suites
Analyzed to identify the most common failure modes
Used to track accuracy and task completion rates over time
Reviewed by human annotators to generate evaluation labels

The feedback loop between observability and evaluation is essential for continuous agent improvement. For structured human review of agent traces, see Human-in-the-Loop AI.

What to Monitor in Production#

Beyond capturing raw trace data, production observability requires active monitoring:

Availability metrics:

Task completion rate
Error rate by error type
Tool call failure rate

Performance metrics:

P50, P95, P99 latency per task type
Token consumption per task
Cost per task over time

Quality metrics:

Hallucination rate (if using grounding checks)
Step accuracy on critical decision points
User satisfaction or feedback scores where available

Operational metrics:

Tokens per minute across the agent fleet
Concurrent agent runs
Queue depth and throughput

Set alerts on task failure rate, latency thresholds, and cost anomalies so issues are caught before they affect large volumes of users.

Implementation Checklist#

Instrument every LLM call with span capture before going to production.
Add tool call tracing with argument and result logging.
Capture retrieval spans if using RAG components.
Set up alerts on task failure rate and latency thresholds.
Export production traces as evaluation test cases monthly.
Track cost per task weekly to detect unexpected increases.
Connect observability data to your evaluation pipeline for continuous quality monitoring.

Frequently Asked Questions#