Team reviewing performance dashboards on computer screens for AI application monitoring — Photo by Dashboard Creator on Unsplash

LangSmith: Complete Platform Profile

LangSmith is the production observability and evaluation platform built by LangChain Inc. for LLM applications and AI agents. Released in 2023 alongside the broader LangChain ecosystem, it was designed to address the observability vacuum that most teams encounter when they first try to run LLM applications in production: without proper tooling, understanding why an agent behaved incorrectly, where latency is accumulating, or whether a prompt change improved quality is extremely difficult. LangSmith fills this gap with structured trace collection, evaluation pipelines, prompt management, and team collaboration features, all with deep native integration into the LangChain and LangGraph frameworks.

Browse the AI agent tools directory to compare LangSmith against other LLM observability platforms, and see the LangChain profile for context on the broader LangChain ecosystem.

Overview#

LangSmith was developed by LangChain Inc., the company founded by Harrison Chase after LangChain became the most widely adopted open-source LLM framework. As LangChain's user base grew, the company recognized that debugging and improving production LLM applications required specialized tooling beyond what general observability platforms could provide.

The core insight driving LangSmith's design is that LLM applications fail differently from traditional software. In traditional software, failures are usually deterministic: a bug produces the same incorrect output every time. In LLM applications, failures are often probabilistic, context-dependent, and require understanding the full chain of prompts, contexts, and model outputs that led to a particular response. A trace viewer that understands the structure of LangChain applications — chains, agents, retrievers, tools — is fundamentally more useful than a generic distributed tracing tool when debugging LLM issues.

LangSmith is available as a managed cloud service with a freemium pricing model. It is not open-source, distinguishing it from Langfuse, which is MIT-licensed and self-hostable. This has implications for teams with data sovereignty requirements: LangSmith processes trace data in LangChain's cloud infrastructure, while Langfuse can be self-hosted.

The platform has grown rapidly alongside the LangChain ecosystem. Teams building LangGraph agents for complex multi-step workflows benefit particularly from LangSmith's ability to visualize the graph execution across agent nodes, making it the default observability choice for the LangGraph community.

Core Features#

End-to-End Tracing#

LangSmith's tracing captures every step in a LangChain or LangGraph execution: chain invocations, LLM calls with full prompt inputs and outputs, tool calls with their arguments and results, retrieval steps showing what documents were returned, and the full latency breakdown across all components.

Traces are organized in a hierarchical tree structure that mirrors the nesting of LangChain components. A root trace represents the full end-to-end operation, with child spans for each chain, retrieval step, or tool call nested beneath it. This structure makes it straightforward to understand how a complex agent decomposed a task and which components contributed to the final output.

For LangGraph applications, LangSmith renders the graph execution with node-by-node visibility, showing which nodes were executed, in which order, what state was passed at each edge, and where the graph branched or looped. This graph visualization is unique to LangSmith and significantly reduces the time required to debug complex LangGraph workflows.

Each trace includes the full token usage and cost breakdown for every LLM call, allowing teams to understand the cost composition of their application and identify expensive operations that might be worth optimizing. Aggregated cost reporting across all traces provides a production cost monitoring capability.

Evaluation and Testing#

LangSmith's evaluation system enables teams to assess the quality of LLM application outputs systematically. Evaluations can be run against datasets of input/output pairs, producing quality scores that can be tracked over time and compared across different prompt versions or model configurations.

LangSmith provides several built-in evaluators for common quality dimensions: correctness (comparing output to a reference answer), context precision and recall (for RAG evaluation), toxicity detection, and custom criteria defined in natural language. Teams can also implement custom evaluators as Python functions or call external evaluation services.

The evaluation workflow integrates directly with the tracing system. Runs (executions of a chain or agent against a dataset) are recorded as traces in LangSmith, so evaluation results are linkable to the specific prompts, contexts, and model outputs that produced them. This traceability is essential for understanding why evaluations succeed or fail on specific examples.

Regression testing is a particularly valuable use case: before deploying a prompt or model change, run an evaluation against a curated dataset of representative inputs and compare scores to the current production baseline. This prevents quality regressions from reaching production without detection.

Prompt Hub#

LangSmith's Prompt Hub is a registry for prompts — a place to version, share, and reuse prompt templates within and across teams. Prompts stored in the Hub are versioned, can be tagged and commented, and can be pulled directly into LangChain applications at runtime using the hub.pull() function.

The Hub serves both organizational and discovery purposes. Teams can maintain a canonical set of production-tested prompts that are shared across projects, preventing the proliferation of slightly different prompt variants that solve the same problem. The public Hub also contains community-shared prompts, which developers can use as starting points.

Prompt versioning in the Hub links back to LangSmith traces: when a trace is created with a Hub prompt, the trace records exactly which version of which prompt was used. This makes it possible to analyze the performance impact of prompt changes by comparing evaluation scores across prompt versions. See the LangChain profile for how LangSmith integrates into the broader LangChain development workflow.

Production Monitoring and Feedback#

Beyond development-time debugging and evaluation, LangSmith supports production monitoring with aggregated metrics over time: trace volume, latency percentiles, error rates, token costs, and evaluation scores. Dashboard views allow teams to track application health and quality over time, identifying regressions or unusual patterns.

User feedback collection integrates with LangSmith's scoring system. Applications can programmatically submit thumbs-up/down or numeric feedback from end users, attaching it to specific traces. This creates a feedback loop where real user quality signals augment automated evaluation scores, providing ground truth from actual users rather than only from evaluators.

Human annotation workflows allow team members to review flagged traces, add quality scores, and leave comments. Annotation queues can be filtered by criteria — low automated evaluation scores, specific user feedback, high cost traces — to focus human review time on the most valuable examples.

Dataset Management#

LangSmith's dataset functionality allows teams to build, curate, and manage evaluation datasets from multiple sources: manually constructed examples, traces selected from production, and imported CSV or JSON files. Datasets support input/output pairs with optional reference outputs and metadata.

Building datasets from production traces is a core workflow: identify traces where the application performed well or poorly, add them to a dataset with appropriate annotations, and use the dataset for regression testing and evaluation. Over time, this produces a dataset that genuinely represents the distribution of real-world inputs and failure modes.

Datasets can be used both for one-time evaluation runs and for continuous evaluation — new traces are automatically evaluated against the dataset as they arrive in production, surfacing quality changes as soon as they occur rather than requiring manual evaluation runs.

Team reviewing performance dashboards on computer screens for AI application monitoring

Pricing and Plans#

LangSmith uses a freemium pricing model:

Developer (Free): 5,000 traces/month, single user, 14-day trace retention
Plus: $39/seat/month — higher trace limits, longer retention, team features
Enterprise: Custom pricing — SSO, on-premises deployment options, SLA, advanced security

The free tier is sufficient for development and small-scale testing. The Plus tier is appropriate for teams actively running LangSmith in production for staging and monitoring. Enterprise pricing addresses organizations with compliance requirements or very high trace volumes.

Note that trace data is processed on LangChain Inc.'s infrastructure in all tiers. For organizations that cannot send trace data off-premises, the enterprise tier includes options for on-premises deployment, or Langfuse (self-hostable) may be a better fit.

Strengths#

Native LangChain and LangGraph integration. No other observability tool has as deep integration with the LangChain ecosystem. Tracing is automatic when LangSmith is configured — no manual instrumentation required for standard LangChain components. LangGraph's graph execution visualization is a unique capability.

Prompt Hub for team collaboration. The centralized prompt registry with versioning and sharing capabilities directly addresses a real team coordination problem: maintaining consistent, tested prompts across a codebase and team. This is a feature that independent observability tools do not offer natively.

Evaluation tightly integrated with tracing. The connection between traces, datasets, evaluation runs, and prompt versions in a single system enables iterative improvement workflows that are difficult to replicate with separate tools.

Actively developed alongside LangChain ecosystem. New LangChain and LangGraph features get LangSmith support quickly since both are developed by the same team. Users of newer LangChain capabilities are likely to find LangSmith support before it appears in third-party observability tools.

Limitations#

Not open-source or self-hostable on standard tiers. Unlike Langfuse, LangSmith is a managed cloud service on all standard plans. Organizations with strict data sovereignty requirements must negotiate enterprise terms for on-premises options or choose a self-hostable alternative.

Primary value is for LangChain users. LangSmith's deepest value proposition is its integration with LangChain. Teams not using LangChain will find that other observability platforms offer equivalent or better support for their stack with less friction.

Free tier limitations constrain real-world use. 5,000 traces/month is quickly exhausted in any meaningful production application. Teams that want to use LangSmith for production monitoring will typically need the Plus tier.

Vendor lock-in considerations. Using LangSmith deeply — for prompt management, evaluation datasets, and production monitoring — creates meaningful switching costs if the team later wants to move to a different observability tool. This is worth considering during initial platform selection.

Ideal Use Cases#

Teams fully invested in LangChain or LangGraph: For organizations where LangChain is the primary development framework, LangSmith's native integration and automatic tracing make it the natural default.
LangGraph agent development and debugging: The graph execution visualization is uniquely valuable for understanding and debugging complex multi-node LangGraph workflows that would be difficult to trace with generic observability tools.
Teams building rigorous evaluation pipelines: Organizations that want to invest in systematic evaluation — regression testing, continuous monitoring, human annotation workflows — benefit from LangSmith's integrated dataset and evaluation system.
Collaborative prompt engineering: Teams where multiple people work on prompt development and need a shared, versioned registry for tested prompts will benefit from the Prompt Hub.

Getting Started#

LangSmith requires setting two environment variables and one line of code to start collecting traces from any LangChain application:

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="ls__..."
export LANGCHAIN_PROJECT="my-project"  # optional, defaults to "default"

With these variables set, any LangChain chain or agent automatically sends traces to LangSmith. No additional instrumentation is required for standard LangChain components.

For LangGraph, tracing works identically — set the environment variables and all graph executions are captured with node-level visibility.

To evaluate a chain against a dataset:

from langsmith import Client
from langchain.chat_models import ChatOpenAI

client = Client()

# Create a dataset
dataset = client.create_dataset("my-eval-dataset")
client.create_example(
    inputs={"question": "What is the capital of France?"},
    outputs={"answer": "Paris"},
    dataset_id=dataset.id
)

# Run evaluation
from langchain.smith import RunEvalConfig, run_on_dataset

eval_config = RunEvalConfig(
    evaluators=["qa"],  # built-in Q&A correctness evaluator
)

results = run_on_dataset(
    dataset_name="my-eval-dataset",
    llm_or_chain_factory=lambda: ChatOpenAI(model="gpt-4o"),
    evaluation=eval_config,
)

To use the Prompt Hub:

from langchain import hub

# Pull a prompt from the Hub
prompt = hub.pull("my-org/my-prompt:v2")

How It Compares#

LangSmith vs Langfuse: Langfuse is the primary alternative, with framework-agnostic support, MIT open-source license, and self-hostable deployment. For teams not on LangChain, Langfuse is generally preferred. For teams fully committed to LangChain, LangSmith's native integration and LangGraph visualization provide advantages. For teams with data sovereignty requirements, Langfuse's self-hostable architecture is often decisive.

LangSmith vs Arize Phoenix: Arize Phoenix is an open-source alternative with strong evaluation capabilities and good framework coverage. It has stronger heritage in ML explainability and model monitoring. LangSmith has better LangChain-specific integration and the Prompt Hub. Phoenix is more relevant for teams with broad ML observability needs beyond LLM applications.

LangSmith vs custom OpenTelemetry instrumentation: Teams with mature observability infrastructure may prefer implementing OpenTelemetry instrumentation for their LLM applications and routing traces to existing backends (Jaeger, Honeycomb, Grafana Tempo). This approach avoids LangSmith's managed service dependency but requires more engineering investment and lacks LLM-specific features like prompt versioning and automated LLM evaluation.

For broader context on LLM infrastructure and tooling choices, browse the AI agent tools directory.

Bottom Line#

LangSmith is the most natural observability choice for teams building on LangChain and LangGraph. The zero-friction tracing setup, LangGraph execution visualization, and Prompt Hub provide real value that is difficult to replicate with generic observability tooling. The evaluation system's integration with tracing enables improvement workflows that require multiple separate tools to replicate externally.

The platform's limitations — managed cloud only, LangChain-centric value, free tier constraints — are worth weighing carefully. Teams with data sovereignty requirements or multi-framework stacks should evaluate Langfuse as the primary alternative. But for the many teams for whom LangChain is their primary AI development framework, LangSmith is the observability platform with the best fit.

Best for: Development teams using LangChain or LangGraph as their primary AI framework who need production-grade observability, evaluation pipelines, and collaborative prompt management tightly integrated with their existing stack.

Frequently Asked Questions#

Is LangSmith required to use LangChain?

No — LangSmith is an optional observability addon, not a requirement. LangChain applications run perfectly well without LangSmith. However, operating LangChain agents in production without any observability tool makes debugging very difficult. LangSmith is the most convenient option for LangChain users, but alternatives like Langfuse also integrate with LangChain through its callback system.

How does LangSmith differ from LangChain?

LangChain is the open-source framework for building LLM applications — the code library you import and use to build chains, agents, and retrievers. LangSmith is the observability and evaluation platform — the cloud service where your LangChain applications send their execution traces. They are separate products from the same company. You use LangChain to build AI applications; you use LangSmith to understand, debug, and improve those applications in production.

Can LangSmith work with non-LangChain applications?

Yes, to a degree. LangSmith exposes a REST API and Python/JavaScript SDKs for manual trace submission, so any application can send traces to LangSmith without using LangChain. However, the integration is most seamless for LangChain applications where tracing is fully automatic. For non-LangChain applications, Langfuse typically offers better framework-agnostic support with equivalent features and the option to self-host.

What is the LangSmith Prompt Hub?

The Prompt Hub is LangSmith's registry for sharing and versioning prompt templates. Teams use it to maintain a canonical set of production-tested prompts that are shared across projects and accessible by all team members. Prompts in the Hub are versioned, can be tagged, and can be pulled directly into LangChain applications at runtime with hub.pull("org/prompt-name"). This prevents the common problem of having many slightly different versions of the same prompt scattered across a codebase.

How does LangSmith handle evaluation for RAG applications?

LangSmith includes built-in evaluators designed for RAG quality assessment, including context precision (how much of the retrieved context was actually used in the answer), context recall (how much of the relevant information was retrieved), answer faithfulness (whether the answer is grounded in the retrieved context), and answer relevance (whether the answer addresses the question). These evaluators use LLM-as-judge patterns and can be supplemented with custom evaluators for domain-specific quality criteria. The evaluation results are linked to the specific traces, so you can see exactly which retrieval steps and model calls produced high or low quality outputs. Learn more about RAG in the AI agent glossary.