🤖AI Agents Guide
TutorialsComparisonsReviewsExamplesIntegrationsUse CasesTemplatesGlossary
Get Started
🤖AI Agents Guide

Your comprehensive resource for understanding, building, and implementing AI Agents.

Learn

  • Tutorials
  • Glossary
  • Use Cases
  • Examples

Compare

  • Tool Comparisons
  • Reviews
  • Integrations
  • Templates

Company

  • About
  • Contact
  • Privacy Policy

© 2026 AI Agents Guide. All rights reserved.

Home/Tutorials/Monitor AI Agents with Langfuse (2026)
intermediate28 min read

Monitor AI Agents with Langfuse (2026)

Learn how to instrument your Python AI agents with Langfuse for full distributed tracing, LLM evaluation, cost tracking, and production monitoring — so you know exactly what your agents are doing at all times.

Dashboard with charts and metrics representing AI agent monitoring and observability
Photo by Luke Chesser on Unsplash
By AI Agents Guide Team•February 28, 2026

Table of Contents

  1. What You'll Learn
  2. Prerequisites
  3. Architecture Overview
  4. Step 1: Install Langfuse and Configure Credentials
  5. Step 2: Add the LangChain Callback Handler
  6. Step 3: Manual Spans for Custom Logic
  7. Step 4: LLM-as-a-Judge Evaluations
  8. Step 5: Cost Tracking and Dashboards
  9. Step 6: Production Setup with Self-Hosting
  10. What's Next
Analytics screen showing trace timelines and evaluation scores for AI requests
Photo by Carlos Muza on Unsplash

Add Observability to AI Agents with Langfuse

An AI agent that works on your laptop but misbehaves in production is one of the most frustrating engineering problems to debug. Without visibility into what the agent decided, what tools it called, what the LLM returned, and how long each step took, you are flying blind.

Langfuse is an open-source LLM observability platform that gives you distributed tracing, evaluation scores, cost tracking, and session replay for every agent run. This tutorial shows you how to add Langfuse instrumentation to a Python LangChain agent from scratch, interpret the traces, and use evaluations to catch regressions before they hit users.

Understanding agent tracing is foundational before diving in — it explains why trace hierarchies matter for multi-step agent workflows.

What You'll Learn#

  • How to install and configure the Langfuse Python SDK
  • How to add automatic tracing to LangChain agents with a single callback
  • How to create manual spans for custom logic outside LangChain
  • How to run LLM-as-a-judge evaluations on agent outputs
  • How to set up cost tracking and usage dashboards in production

Prerequisites#

  • Python 3.10+
  • A Langfuse account (free at langfuse.com ↗) or self-hosted instance
  • OpenAI API key
  • Familiarity with LangChain agents

Architecture Overview#

Langfuse works through a trace → span hierarchy. When your agent runs:

  1. A trace is created representing the entire agent interaction
  2. Each LLM call becomes a generation span with input/output/token counts
  3. Each tool call becomes a span with input/output/duration
  4. The whole trace is linked to a session if you pass a session_id

The Langfuse dashboard lets you search, filter, and replay any trace. Evaluations attach scores to traces so you can measure quality over time.

Step 1: Install Langfuse and Configure Credentials#

pip install langfuse==2.57.0 langchain==0.3.0 langchain-openai==0.2.0 \
    langchain-community==0.3.0 python-dotenv==1.0.1

Get your keys from the Langfuse dashboard under Settings → API Keys:

# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted URL
OPENAI_API_KEY=sk-proj-...
# config.py
import os
from dotenv import load_dotenv

load_dotenv()

LANGFUSE_PUBLIC_KEY = os.getenv("LANGFUSE_PUBLIC_KEY")
LANGFUSE_SECRET_KEY = os.getenv("LANGFUSE_SECRET_KEY")
LANGFUSE_HOST = os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com")

Step 2: Add the LangChain Callback Handler#

The fastest way to get tracing is the CallbackHandler. Add it to any LangChain call and every LLM call, chain step, and tool invocation is automatically traced.

# tracing.py
from langfuse.callback import CallbackHandler
import os

def get_langfuse_handler(session_id: str = None, user_id: str = None) -> CallbackHandler:
    """Create a Langfuse callback handler for a single agent run."""
    return CallbackHandler(
        public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
        secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
        host=os.getenv("LANGFUSE_HOST"),
        session_id=session_id,
        user_id=user_id,
    )
# agent.py
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.prompts import PromptTemplate
from tracing import get_langfuse_handler
import uuid

def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    tools = [DuckDuckGoSearchRun()]
    prompt = PromptTemplate.from_template(
        "Answer the question using tools as needed.\n\n"
        "Question: {input}\n\nThought: {agent_scratchpad}"
    )
    agent = create_react_agent(llm, tools, prompt)
    return AgentExecutor(agent=agent, tools=tools, verbose=False, max_iterations=5)

def run_agent(question: str, user_id: str = "anonymous") -> str:
    """Run the agent with Langfuse tracing enabled."""
    executor = build_agent()
    session_id = str(uuid.uuid4())
    handler = get_langfuse_handler(session_id=session_id, user_id=user_id)

    result = executor.invoke(
        {"input": question},
        config={"callbacks": [handler]},  # <-- single line to enable tracing
    )
    handler.flush()  # ensure all spans are sent before function returns
    return result["output"]

# Test it
if __name__ == "__main__":
    answer = run_agent("What is LangGraph used for?", user_id="dev-test")
    print(answer)

After running this, open the Langfuse dashboard and you will see a trace with all LLM calls, tool invocations, latencies, and token counts.

Step 3: Manual Spans for Custom Logic#

When you have business logic outside LangChain — preprocessing, postprocessing, database lookups — you can create manual spans to capture them in the same trace.

# manual_tracing.py
from langfuse import Langfuse
import os
import time

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST"),
)

def run_agent_with_pipeline(user_question: str, user_id: str) -> dict:
    # Create a root trace for the entire pipeline
    trace = langfuse.trace(
        name="full-agent-pipeline",
        user_id=user_id,
        input={"question": user_question},
    )

    # Span 1: input validation and preprocessing
    preprocess_span = trace.span(
        name="input-preprocessing",
        input={"raw_question": user_question},
    )
    cleaned_question = user_question.strip().lower()
    preprocess_span.end(output={"cleaned_question": cleaned_question})

    # Span 2: the LangChain agent (with callback handler linked to this trace)
    from langfuse.callback import CallbackHandler
    handler = CallbackHandler(trace_id=trace.id)

    from agent import build_agent
    executor = build_agent()
    start = time.time()
    result = executor.invoke({"input": cleaned_question}, config={"callbacks": [handler]})
    latency_ms = (time.time() - start) * 1000

    # Span 3: postprocessing
    post_span = trace.span(
        name="output-postprocessing",
        input={"raw_output": result["output"]},
    )
    final_answer = result["output"].strip()
    post_span.end(output={"final_answer": final_answer})

    # Update the root trace with the final output and metadata
    trace.update(
        output={"answer": final_answer},
        metadata={"latency_ms": latency_ms},
    )
    langfuse.flush()

    return {"answer": final_answer, "trace_id": trace.id}

Langfuse trace view showing nested spans for preprocessing, LLM calls, tool use, and postprocessing

Step 4: LLM-as-a-Judge Evaluations#

Tracing tells you what happened. Evaluations tell you how good it was. Langfuse supports posting scores to any trace. Here we use an LLM to score answer quality.

# evaluator.py
from langfuse import Langfuse
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import os

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)

class QualityScore(BaseModel):
    score: float = Field(ge=0.0, le=1.0, description="Answer quality from 0 to 1")
    reasoning: str = Field(description="One sentence explaining the score")

eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
eval_chain = ChatPromptTemplate.from_messages([
    ("system", "Rate the quality of this AI agent answer from 0 to 1. "
               "1.0 = accurate, complete, concise. 0.0 = wrong or incoherent."),
    ("human", "Question: {question}\nAnswer: {answer}"),
]) | eval_llm.with_structured_output(QualityScore)

def evaluate_and_score(trace_id: str, question: str, answer: str):
    """Run evaluation and post the score back to the Langfuse trace."""
    result = eval_chain.invoke({"question": question, "answer": answer})

    langfuse.score(
        trace_id=trace_id,
        name="answer-quality",
        value=result.score,
        comment=result.reasoning,
    )
    return result

# After running your agent:
# result = run_agent_with_pipeline("What is LangGraph?", "user-123")
# evaluate_and_score(result["trace_id"], "What is LangGraph?", result["answer"])

Step 5: Cost Tracking and Dashboards#

Langfuse automatically parses token usage from OpenAI responses and calculates cost based on current model pricing. You can filter the dashboard by:

  • Date range — see daily cost trends
  • User ID — identify your most expensive users
  • Model — compare gpt-4o vs gpt-4o-mini cost profiles
  • Session — audit a specific conversation

To add custom cost metadata when using non-OpenAI models:

# For models where Langfuse cannot auto-detect pricing
generation = trace.generation(
    name="custom-llm-call",
    model="my-fine-tuned-model",
    input={"prompt": "..."},
    usage={
        "input": 450,       # prompt tokens
        "output": 120,      # completion tokens
        "unit": "TOKENS",
    },
)
generation.end(output={"response": "..."})

Step 6: Production Setup with Self-Hosting#

For production workloads with sensitive data, self-host Langfuse with Docker Compose:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.prod.example .env
# Edit .env with your DATABASE_URL, NEXTAUTH_SECRET, etc.
docker compose up -d

Point your SDK at the self-hosted instance:

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://langfuse.your-domain.com",
)

For containerized agents, review the Docker deployment guide to see how to pass Langfuse environment variables into your containers. You can also integrate Langfuse tracing into the Agentic RAG system to track retrieval quality alongside generation quality.

What's Next#

  • Apply these observability patterns to a more complex LangGraph multi-agent system
  • Explore agent tracing concepts for deeper background on why trace hierarchies matter
  • Review the AI agent testing guide to combine automated tests with Langfuse evaluation scores
  • See the Langfuse directory entry for a full feature comparison with other observability tools
  • Add human review workflows on top of Langfuse scores using human-in-the-loop patterns

Related Tutorials

How to Create a Meeting Scheduling AI Agent

Build an autonomous AI agent to handle meeting scheduling, calendar checks, and bookings intelligently. This step-by-step tutorial covers Python implementation with LangChain, Google Calendar integration, and advanced features like conflict resolution for efficient automation.

How to Manage Multiple AI Agents

Master managing multiple AI agents with this in-depth tutorial. Learn orchestration, state sharing, parallel execution, and scaling using LangGraph and custom tools. From basics to production-ready swarms for complex tasks.

How to Train an AI Agent on Your Own Data

Master training AI agents on custom data with three methods: context stuffing, RAG using vector databases, and fine-tuning. This beginner-to-advanced guide includes step-by-step code examples, pitfalls, and best practices to build knowledgeable agents for your specific needs.

← Back to All Tutorials