Analytics screen showing trace timelines and evaluation scores for AI requests — Photo by Carlos Muza on Unsplash

Add Observability to AI Agents with Langfuse

An AI agent that works on your laptop but misbehaves in production is one of the most frustrating engineering problems to debug. Without visibility into what the agent decided, what tools it called, what the LLM returned, and how long each step took, you are flying blind.

Langfuse is an open-source LLM observability platform that gives you distributed tracing, evaluation scores, cost tracking, and session replay for every agent run. This tutorial shows you how to add Langfuse instrumentation to a Python LangChain agent from scratch, interpret the traces, and use evaluations to catch regressions before they hit users.

Understanding agent tracing is foundational before diving in — it explains why trace hierarchies matter for multi-step agent workflows.

What You'll Learn#

How to install and configure the Langfuse Python SDK
How to add automatic tracing to LangChain agents with a single callback
How to create manual spans for custom logic outside LangChain
How to run LLM-as-a-judge evaluations on agent outputs
How to set up cost tracking and usage dashboards in production

Prerequisites#

Python 3.10+
A Langfuse account (free at langfuse.com) or self-hosted instance
OpenAI API key
Familiarity with LangChain agents

Architecture Overview#

Langfuse works through a trace → span hierarchy. When your agent runs:

A trace is created representing the entire agent interaction
Each LLM call becomes a generation span with input/output/token counts
Each tool call becomes a span with input/output/duration
The whole trace is linked to a session if you pass a session_id

The Langfuse dashboard lets you search, filter, and replay any trace. Evaluations attach scores to traces so you can measure quality over time.

Step 1: Install Langfuse and Configure Credentials#

pip install langfuse==2.57.0 langchain==0.3.0 langchain-openai==0.2.0 \
    langchain-community==0.3.0 python-dotenv==1.0.1

Get your keys from the Langfuse dashboard under Settings → API Keys:

# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted URL
OPENAI_API_KEY=sk-proj-...

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

LANGFUSE_PUBLIC_KEY = os.getenv("LANGFUSE_PUBLIC_KEY")
LANGFUSE_SECRET_KEY = os.getenv("LANGFUSE_SECRET_KEY")
LANGFUSE_HOST = os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com")

Step 2: Add the LangChain Callback Handler#

The fastest way to get tracing is the CallbackHandler. Add it to any LangChain call and every LLM call, chain step, and tool invocation is automatically traced.

# tracing.py
from langfuse.callback import CallbackHandler
import os

def get_langfuse_handler(session_id: str = None, user_id: str = None) -> CallbackHandler:
    """Create a Langfuse callback handler for a single agent run."""
    return CallbackHandler(
        public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
        secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
        host=os.getenv("LANGFUSE_HOST"),
        session_id=session_id,
        user_id=user_id,
    )

# agent.py
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.prompts import PromptTemplate
from tracing import get_langfuse_handler
import uuid

def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    tools = [DuckDuckGoSearchRun()]
    prompt = PromptTemplate.from_template(
        "Answer the question using tools as needed.\n\n"
        "Question: {input}\n\nThought: {agent_scratchpad}"
    )
    agent = create_react_agent(llm, tools, prompt)
    return AgentExecutor(agent=agent, tools=tools, verbose=False, max_iterations=5)

def run_agent(question: str, user_id: str = "anonymous") -> str:
    """Run the agent with Langfuse tracing enabled."""
    executor = build_agent()
    session_id = str(uuid.uuid4())
    handler = get_langfuse_handler(session_id=session_id, user_id=user_id)

    result = executor.invoke(
        {"input": question},
        config={"callbacks": [handler]},  # <-- single line to enable tracing
    )
    handler.flush()  # ensure all spans are sent before function returns
    return result["output"]

# Test it
if __name__ == "__main__":
    answer = run_agent("What is LangGraph used for?", user_id="dev-test")
    print(answer)

After running this, open the Langfuse dashboard and you will see a trace with all LLM calls, tool invocations, latencies, and token counts.

Step 3: Manual Spans for Custom Logic#

When you have business logic outside LangChain — preprocessing, postprocessing, database lookups — you can create manual spans to capture them in the same trace.

# manual_tracing.py
from langfuse import Langfuse
import os
import time

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST"),
)

def run_agent_with_pipeline(user_question: str, user_id: str) -> dict:
    # Create a root trace for the entire pipeline
    trace = langfuse.trace(
        name="full-agent-pipeline",
        user_id=user_id,
        input={"question": user_question},
    )

    # Span 1: input validation and preprocessing
    preprocess_span = trace.span(
        name="input-preprocessing",
        input={"raw_question": user_question},
    )
    cleaned_question = user_question.strip().lower()
    preprocess_span.end(output={"cleaned_question": cleaned_question})

    # Span 2: the LangChain agent (with callback handler linked to this trace)
    from langfuse.callback import CallbackHandler
    handler = CallbackHandler(trace_id=trace.id)

    from agent import build_agent
    executor = build_agent()
    start = time.time()
    result = executor.invoke({"input": cleaned_question}, config={"callbacks": [handler]})
    latency_ms = (time.time() - start) * 1000

    # Span 3: postprocessing
    post_span = trace.span(
        name="output-postprocessing",
        input={"raw_output": result["output"]},
    )
    final_answer = result["output"].strip()
    post_span.end(output={"final_answer": final_answer})

    # Update the root trace with the final output and metadata
    trace.update(
        output={"answer": final_answer},
        metadata={"latency_ms": latency_ms},
    )
    langfuse.flush()

    return {"answer": final_answer, "trace_id": trace.id}

Langfuse trace view showing nested spans for preprocessing, LLM calls, tool use, and postprocessing

Step 4: LLM-as-a-Judge Evaluations#

Tracing tells you what happened. Evaluations tell you how good it was. Langfuse supports posting scores to any trace. Here we use an LLM to score answer quality.

# evaluator.py
from langfuse import Langfuse
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import os

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)

class QualityScore(BaseModel):
    score: float = Field(ge=0.0, le=1.0, description="Answer quality from 0 to 1")
    reasoning: str = Field(description="One sentence explaining the score")

eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
eval_chain = ChatPromptTemplate.from_messages([
    ("system", "Rate the quality of this AI agent answer from 0 to 1. "
               "1.0 = accurate, complete, concise. 0.0 = wrong or incoherent."),
    ("human", "Question: {question}\nAnswer: {answer}"),
]) | eval_llm.with_structured_output(QualityScore)

def evaluate_and_score(trace_id: str, question: str, answer: str):
    """Run evaluation and post the score back to the Langfuse trace."""
    result = eval_chain.invoke({"question": question, "answer": answer})

    langfuse.score(
        trace_id=trace_id,
        name="answer-quality",
        value=result.score,
        comment=result.reasoning,
    )
    return result

# After running your agent:
# result = run_agent_with_pipeline("What is LangGraph?", "user-123")
# evaluate_and_score(result["trace_id"], "What is LangGraph?", result["answer"])

Step 5: Cost Tracking and Dashboards#

Langfuse automatically parses token usage from OpenAI responses and calculates cost based on current model pricing. You can filter the dashboard by:

Date range — see daily cost trends
User ID — identify your most expensive users
Model — compare gpt-4o vs gpt-4o-mini cost profiles
Session — audit a specific conversation

To add custom cost metadata when using non-OpenAI models:

# For models where Langfuse cannot auto-detect pricing
generation = trace.generation(
    name="custom-llm-call",
    model="my-fine-tuned-model",
    input={"prompt": "..."},
    usage={
        "input": 450,       # prompt tokens
        "output": 120,      # completion tokens
        "unit": "TOKENS",
    },
)
generation.end(output={"response": "..."})

Step 6: Production Setup with Self-Hosting#

For production workloads with sensitive data, self-host Langfuse with Docker Compose:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.prod.example .env
# Edit .env with your DATABASE_URL, NEXTAUTH_SECRET, etc.
docker compose up -d

Point your SDK at the self-hosted instance:

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://langfuse.your-domain.com",
)

For containerized agents, review the Docker deployment guide to see how to pass Langfuse environment variables into your containers. You can also integrate Langfuse tracing into the Agentic RAG system to track retrieval quality alongside generation quality.

What's Next#

Apply these observability patterns to a more complex LangGraph multi-agent system
Explore agent tracing concepts for deeper background on why trace hierarchies matter
Review the AI agent testing guide to combine automated tests with Langfuse evaluation scores
See the Langfuse directory entry for a full feature comparison with other observability tools
Add human review workflows on top of Langfuse scores using human-in-the-loop patterns

Add Observability to AI Agents with Langfuse

Understanding agent tracing is foundational before diving in — it explains why trace hierarchies matter for multi-step agent workflows.

What You'll Learn#

How to install and configure the Langfuse Python SDK
How to add automatic tracing to LangChain agents with a single callback
How to create manual spans for custom logic outside LangChain
How to run LLM-as-a-judge evaluations on agent outputs
How to set up cost tracking and usage dashboards in production

Prerequisites#

Python 3.10+
A Langfuse account (free at langfuse.com) or self-hosted instance
OpenAI API key
Familiarity with LangChain agents

Architecture Overview#

Langfuse works through a trace → span hierarchy. When your agent runs:

A trace is created representing the entire agent interaction
Each LLM call becomes a generation span with input/output/token counts
Each tool call becomes a span with input/output/duration
The whole trace is linked to a session if you pass a session_id

The Langfuse dashboard lets you search, filter, and replay any trace. Evaluations attach scores to traces so you can measure quality over time.

Step 1: Install Langfuse and Configure Credentials#

pip install langfuse==2.57.0 langchain==0.3.0 langchain-openai==0.2.0 \
    langchain-community==0.3.0 python-dotenv==1.0.1

Get your keys from the Langfuse dashboard under Settings → API Keys:

# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted URL
OPENAI_API_KEY=sk-proj-...

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

LANGFUSE_PUBLIC_KEY = os.getenv("LANGFUSE_PUBLIC_KEY")
LANGFUSE_SECRET_KEY = os.getenv("LANGFUSE_SECRET_KEY")
LANGFUSE_HOST = os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com")

Step 2: Add the LangChain Callback Handler#

The fastest way to get tracing is the CallbackHandler. Add it to any LangChain call and every LLM call, chain step, and tool invocation is automatically traced.

# tracing.py
from langfuse.callback import CallbackHandler
import os

def get_langfuse_handler(session_id: str = None, user_id: str = None) -> CallbackHandler:
    """Create a Langfuse callback handler for a single agent run."""
    return CallbackHandler(
        public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
        secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
        host=os.getenv("LANGFUSE_HOST"),
        session_id=session_id,
        user_id=user_id,
    )

# agent.py
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.prompts import PromptTemplate
from tracing import get_langfuse_handler
import uuid

def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    tools = [DuckDuckGoSearchRun()]
    prompt = PromptTemplate.from_template(
        "Answer the question using tools as needed.\n\n"
        "Question: {input}\n\nThought: {agent_scratchpad}"
    )
    agent = create_react_agent(llm, tools, prompt)
    return AgentExecutor(agent=agent, tools=tools, verbose=False, max_iterations=5)

def run_agent(question: str, user_id: str = "anonymous") -> str:
    """Run the agent with Langfuse tracing enabled."""
    executor = build_agent()
    session_id = str(uuid.uuid4())
    handler = get_langfuse_handler(session_id=session_id, user_id=user_id)

    result = executor.invoke(
        {"input": question},
        config={"callbacks": [handler]},  # <-- single line to enable tracing
    )
    handler.flush()  # ensure all spans are sent before function returns
    return result["output"]

# Test it
if __name__ == "__main__":
    answer = run_agent("What is LangGraph used for?", user_id="dev-test")
    print(answer)

After running this, open the Langfuse dashboard and you will see a trace with all LLM calls, tool invocations, latencies, and token counts.

Step 3: Manual Spans for Custom Logic#

When you have business logic outside LangChain — preprocessing, postprocessing, database lookups — you can create manual spans to capture them in the same trace.

# manual_tracing.py
from langfuse import Langfuse
import os
import time

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST"),
)

def run_agent_with_pipeline(user_question: str, user_id: str) -> dict:
    # Create a root trace for the entire pipeline
    trace = langfuse.trace(
        name="full-agent-pipeline",
        user_id=user_id,
        input={"question": user_question},
    )

    # Span 1: input validation and preprocessing
    preprocess_span = trace.span(
        name="input-preprocessing",
        input={"raw_question": user_question},
    )
    cleaned_question = user_question.strip().lower()
    preprocess_span.end(output={"cleaned_question": cleaned_question})

    # Span 2: the LangChain agent (with callback handler linked to this trace)
    from langfuse.callback import CallbackHandler
    handler = CallbackHandler(trace_id=trace.id)

    from agent import build_agent
    executor = build_agent()
    start = time.time()
    result = executor.invoke({"input": cleaned_question}, config={"callbacks": [handler]})
    latency_ms = (time.time() - start) * 1000

    # Span 3: postprocessing
    post_span = trace.span(
        name="output-postprocessing",
        input={"raw_output": result["output"]},
    )
    final_answer = result["output"].strip()
    post_span.end(output={"final_answer": final_answer})

    # Update the root trace with the final output and metadata
    trace.update(
        output={"answer": final_answer},
        metadata={"latency_ms": latency_ms},
    )
    langfuse.flush()

    return {"answer": final_answer, "trace_id": trace.id}

Langfuse trace view showing nested spans for preprocessing, LLM calls, tool use, and postprocessing

Step 4: LLM-as-a-Judge Evaluations#

Tracing tells you what happened. Evaluations tell you how good it was. Langfuse supports posting scores to any trace. Here we use an LLM to score answer quality.

# evaluator.py
from langfuse import Langfuse
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import os

langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)

class QualityScore(BaseModel):
    score: float = Field(ge=0.0, le=1.0, description="Answer quality from 0 to 1")
    reasoning: str = Field(description="One sentence explaining the score")

eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
eval_chain = ChatPromptTemplate.from_messages([
    ("system", "Rate the quality of this AI agent answer from 0 to 1. "
               "1.0 = accurate, complete, concise. 0.0 = wrong or incoherent."),
    ("human", "Question: {question}\nAnswer: {answer}"),
]) | eval_llm.with_structured_output(QualityScore)

def evaluate_and_score(trace_id: str, question: str, answer: str):
    """Run evaluation and post the score back to the Langfuse trace."""
    result = eval_chain.invoke({"question": question, "answer": answer})

    langfuse.score(
        trace_id=trace_id,
        name="answer-quality",
        value=result.score,
        comment=result.reasoning,
    )
    return result

# After running your agent:
# result = run_agent_with_pipeline("What is LangGraph?", "user-123")
# evaluate_and_score(result["trace_id"], "What is LangGraph?", result["answer"])

Step 5: Cost Tracking and Dashboards#

Langfuse automatically parses token usage from OpenAI responses and calculates cost based on current model pricing. You can filter the dashboard by:

Date range — see daily cost trends
User ID — identify your most expensive users
Model — compare gpt-4o vs gpt-4o-mini cost profiles
Session — audit a specific conversation

To add custom cost metadata when using non-OpenAI models:

# For models where Langfuse cannot auto-detect pricing
generation = trace.generation(
    name="custom-llm-call",
    model="my-fine-tuned-model",
    input={"prompt": "..."},
    usage={
        "input": 450,       # prompt tokens
        "output": 120,      # completion tokens
        "unit": "TOKENS",
    },
)
generation.end(output={"response": "..."})

Step 6: Production Setup with Self-Hosting#

For production workloads with sensitive data, self-host Langfuse with Docker Compose:

git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.prod.example .env
# Edit .env with your DATABASE_URL, NEXTAUTH_SECRET, etc.
docker compose up -d

Point your SDK at the self-hosted instance:

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://langfuse.your-domain.com",
)

What's Next#

Apply these observability patterns to a more complex LangGraph multi-agent system
Explore agent tracing concepts for deeper background on why trace hierarchies matter
Review the AI agent testing guide to combine automated tests with Langfuse evaluation scores
See the Langfuse directory entry for a full feature comparison with other observability tools
Add human review workflows on top of Langfuse scores using human-in-the-loop patterns

Monitor AI Agents with Langfuse (2026)

Add Observability to AI Agents with Langfuse

What You'll Learn#

Prerequisites#

Architecture Overview#

Step 1: Install Langfuse and Configure Credentials#

Step 2: Add the LangChain Callback Handler#

Step 3: Manual Spans for Custom Logic#

Step 4: LLM-as-a-Judge Evaluations#

Step 5: Cost Tracking and Dashboards#

Step 6: Production Setup with Self-Hosting#

What's Next#

Monitor AI Agents with Langfuse (2026)

Add Observability to AI Agents with Langfuse

What You'll Learn#

Prerequisites#

Architecture Overview#

Step 1: Install Langfuse and Configure Credentials#

Step 2: Add the LangChain Callback Handler#

Step 3: Manual Spans for Custom Logic#

Step 4: LLM-as-a-Judge Evaluations#

Step 5: Cost Tracking and Dashboards#

Step 6: Production Setup with Self-Hosting#

What's Next#