Add Observability to AI Agents with Langfuse
An AI agent that works on your laptop but misbehaves in production is one of the most frustrating engineering problems to debug. Without visibility into what the agent decided, what tools it called, what the LLM returned, and how long each step took, you are flying blind.
Langfuse is an open-source LLM observability platform that gives you distributed tracing, evaluation scores, cost tracking, and session replay for every agent run. This tutorial shows you how to add Langfuse instrumentation to a Python LangChain agent from scratch, interpret the traces, and use evaluations to catch regressions before they hit users.
Understanding agent tracing is foundational before diving in — it explains why trace hierarchies matter for multi-step agent workflows.
What You'll Learn#
- How to install and configure the Langfuse Python SDK
- How to add automatic tracing to LangChain agents with a single callback
- How to create manual spans for custom logic outside LangChain
- How to run LLM-as-a-judge evaluations on agent outputs
- How to set up cost tracking and usage dashboards in production
Prerequisites#
- Python 3.10+
- A Langfuse account (free at langfuse.com) or self-hosted instance
- OpenAI API key
- Familiarity with LangChain agents
Architecture Overview#
Langfuse works through a trace → span hierarchy. When your agent runs:
- A trace is created representing the entire agent interaction
- Each LLM call becomes a generation span with input/output/token counts
- Each tool call becomes a span with input/output/duration
- The whole trace is linked to a session if you pass a
session_id
The Langfuse dashboard lets you search, filter, and replay any trace. Evaluations attach scores to traces so you can measure quality over time.
Step 1: Install Langfuse and Configure Credentials#
pip install langfuse==2.57.0 langchain==0.3.0 langchain-openai==0.2.0 \
langchain-community==0.3.0 python-dotenv==1.0.1
Get your keys from the Langfuse dashboard under Settings → API Keys:
# .env
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted URL
OPENAI_API_KEY=sk-proj-...
# config.py
import os
from dotenv import load_dotenv
load_dotenv()
LANGFUSE_PUBLIC_KEY = os.getenv("LANGFUSE_PUBLIC_KEY")
LANGFUSE_SECRET_KEY = os.getenv("LANGFUSE_SECRET_KEY")
LANGFUSE_HOST = os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com")
Step 2: Add the LangChain Callback Handler#
The fastest way to get tracing is the CallbackHandler. Add it to any LangChain call and every LLM call, chain step, and tool invocation is automatically traced.
# tracing.py
from langfuse.callback import CallbackHandler
import os
def get_langfuse_handler(session_id: str = None, user_id: str = None) -> CallbackHandler:
"""Create a Langfuse callback handler for a single agent run."""
return CallbackHandler(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
host=os.getenv("LANGFUSE_HOST"),
session_id=session_id,
user_id=user_id,
)
# agent.py
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent, AgentExecutor
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.prompts import PromptTemplate
from tracing import get_langfuse_handler
import uuid
def build_agent() -> AgentExecutor:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
tools = [DuckDuckGoSearchRun()]
prompt = PromptTemplate.from_template(
"Answer the question using tools as needed.\n\n"
"Question: {input}\n\nThought: {agent_scratchpad}"
)
agent = create_react_agent(llm, tools, prompt)
return AgentExecutor(agent=agent, tools=tools, verbose=False, max_iterations=5)
def run_agent(question: str, user_id: str = "anonymous") -> str:
"""Run the agent with Langfuse tracing enabled."""
executor = build_agent()
session_id = str(uuid.uuid4())
handler = get_langfuse_handler(session_id=session_id, user_id=user_id)
result = executor.invoke(
{"input": question},
config={"callbacks": [handler]}, # <-- single line to enable tracing
)
handler.flush() # ensure all spans are sent before function returns
return result["output"]
# Test it
if __name__ == "__main__":
answer = run_agent("What is LangGraph used for?", user_id="dev-test")
print(answer)
After running this, open the Langfuse dashboard and you will see a trace with all LLM calls, tool invocations, latencies, and token counts.
Step 3: Manual Spans for Custom Logic#
When you have business logic outside LangChain — preprocessing, postprocessing, database lookups — you can create manual spans to capture them in the same trace.
# manual_tracing.py
from langfuse import Langfuse
import os
import time
langfuse = Langfuse(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
host=os.getenv("LANGFUSE_HOST"),
)
def run_agent_with_pipeline(user_question: str, user_id: str) -> dict:
# Create a root trace for the entire pipeline
trace = langfuse.trace(
name="full-agent-pipeline",
user_id=user_id,
input={"question": user_question},
)
# Span 1: input validation and preprocessing
preprocess_span = trace.span(
name="input-preprocessing",
input={"raw_question": user_question},
)
cleaned_question = user_question.strip().lower()
preprocess_span.end(output={"cleaned_question": cleaned_question})
# Span 2: the LangChain agent (with callback handler linked to this trace)
from langfuse.callback import CallbackHandler
handler = CallbackHandler(trace_id=trace.id)
from agent import build_agent
executor = build_agent()
start = time.time()
result = executor.invoke({"input": cleaned_question}, config={"callbacks": [handler]})
latency_ms = (time.time() - start) * 1000
# Span 3: postprocessing
post_span = trace.span(
name="output-postprocessing",
input={"raw_output": result["output"]},
)
final_answer = result["output"].strip()
post_span.end(output={"final_answer": final_answer})
# Update the root trace with the final output and metadata
trace.update(
output={"answer": final_answer},
metadata={"latency_ms": latency_ms},
)
langfuse.flush()
return {"answer": final_answer, "trace_id": trace.id}
Step 4: LLM-as-a-Judge Evaluations#
Tracing tells you what happened. Evaluations tell you how good it was. Langfuse supports posting scores to any trace. Here we use an LLM to score answer quality.
# evaluator.py
from langfuse import Langfuse
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import os
langfuse = Langfuse(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)
class QualityScore(BaseModel):
score: float = Field(ge=0.0, le=1.0, description="Answer quality from 0 to 1")
reasoning: str = Field(description="One sentence explaining the score")
eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
eval_chain = ChatPromptTemplate.from_messages([
("system", "Rate the quality of this AI agent answer from 0 to 1. "
"1.0 = accurate, complete, concise. 0.0 = wrong or incoherent."),
("human", "Question: {question}\nAnswer: {answer}"),
]) | eval_llm.with_structured_output(QualityScore)
def evaluate_and_score(trace_id: str, question: str, answer: str):
"""Run evaluation and post the score back to the Langfuse trace."""
result = eval_chain.invoke({"question": question, "answer": answer})
langfuse.score(
trace_id=trace_id,
name="answer-quality",
value=result.score,
comment=result.reasoning,
)
return result
# After running your agent:
# result = run_agent_with_pipeline("What is LangGraph?", "user-123")
# evaluate_and_score(result["trace_id"], "What is LangGraph?", result["answer"])
Step 5: Cost Tracking and Dashboards#
Langfuse automatically parses token usage from OpenAI responses and calculates cost based on current model pricing. You can filter the dashboard by:
- Date range — see daily cost trends
- User ID — identify your most expensive users
- Model — compare gpt-4o vs gpt-4o-mini cost profiles
- Session — audit a specific conversation
To add custom cost metadata when using non-OpenAI models:
# For models where Langfuse cannot auto-detect pricing
generation = trace.generation(
name="custom-llm-call",
model="my-fine-tuned-model",
input={"prompt": "..."},
usage={
"input": 450, # prompt tokens
"output": 120, # completion tokens
"unit": "TOKENS",
},
)
generation.end(output={"response": "..."})
Step 6: Production Setup with Self-Hosting#
For production workloads with sensitive data, self-host Langfuse with Docker Compose:
git clone https://github.com/langfuse/langfuse.git
cd langfuse
cp .env.prod.example .env
# Edit .env with your DATABASE_URL, NEXTAUTH_SECRET, etc.
docker compose up -d
Point your SDK at the self-hosted instance:
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://langfuse.your-domain.com",
)
For containerized agents, review the Docker deployment guide to see how to pass Langfuse environment variables into your containers. You can also integrate Langfuse tracing into the Agentic RAG system to track retrieval quality alongside generation quality.
What's Next#
- Apply these observability patterns to a more complex LangGraph multi-agent system
- Explore agent tracing concepts for deeper background on why trace hierarchies matter
- Review the AI agent testing guide to combine automated tests with Langfuse evaluation scores
- See the Langfuse directory entry for a full feature comparison with other observability tools
- Add human review workflows on top of Langfuse scores using human-in-the-loop patterns