🤖AI Agents Guide
TutorialsComparisonsReviewsExamplesIntegrationsUse CasesTemplatesGlossary
Get Started
🤖AI Agents Guide

Your comprehensive resource for understanding, building, and implementing AI Agents.

Learn

  • Tutorials
  • Glossary
  • Use Cases
  • Examples

Compare

  • Tool Comparisons
  • Reviews
  • Integrations
  • Templates

Company

  • About
  • Contact
  • Privacy Policy

© 2026 AI Agents Guide. All rights reserved.

Home/Tutorials/Build an Agentic RAG with LangChain
advanced35 min read

Build an Agentic RAG with LangChain

Learn how to build an Agentic RAG system that goes beyond static retrieval with query routing, multi-step retrieval loops, and self-reflection to improve answer quality. Master the patterns that make RAG truly agentic.

a close up of a black and yellow background
Photo by Rick Rothenberg on Unsplash
By AI Agents Guide Team•February 28, 2026

Table of Contents

  1. What You'll Learn
  2. Prerequisites
  3. Architecture Overview
  4. Step 1: Setup and Dependencies
  5. Step 2: Build the Vector Store and Retriever
  6. Step 3: Query Router
  7. Step 4: Document Grader
  8. Step 5: Generator with Self-Reflection
  9. Step 6: Wire Everything into a LangGraph Workflow
  10. Step 7: Testing and Evaluation
  11. Production Considerations
  12. What's Next
Diagram of connected nodes representing retrieval and reasoning steps in an agentic system
Photo by Alina Grubnyak on Unsplash

Build an Agentic RAG System with LangChain and Python

Standard RAG pipelines follow a fixed path: embed the question, retrieve top-k chunks, stuff them into a prompt, and return an answer. This works for simple factual questions, but it breaks down fast when questions are ambiguous, require synthesizing information from multiple sources, or need verification. Agentic RAG solves this by treating retrieval as a decision — not a deterministic step.

In this tutorial you will build an Agentic RAG system that dynamically routes queries, iterates retrieval when the first pass falls short, grades its own retrieved documents, and reflects on whether the final answer actually addresses the original question. The result is a system that handles hard real-world queries with significantly higher reliability than naive RAG.

What You'll Learn#

  • How to implement query routing to direct questions to the right knowledge source
  • How to grade retrieved documents for relevance before generating an answer
  • How to build a self-reflection loop that detects hallucination and re-queries
  • How to connect all components into a coherent LangGraph workflow
  • How to evaluate the quality of your agentic RAG pipeline

Prerequisites#

  • Python 3.10+
  • OpenAI API key (or any LangChain-compatible LLM)
  • Basic familiarity with RAG concepts for AI agents
  • Understanding of what AI agents are and how they differ from pipelines

Architecture Overview#

The system has four stages that form a conditional loop:

  1. Router — Classifies the incoming question and directs it to a vectorstore, a web search tool, or a structured database query.
  2. Retriever — Fetches candidate documents from the selected source.
  3. Grader — Evaluates each retrieved document for relevance to the question. Irrelevant documents are filtered; if too few pass, the system re-routes.
  4. Generator with Self-Reflection — Produces an answer, then checks whether the answer is grounded in the retrieved context and actually resolves the question. If not, it loops.

This loop can run up to a configurable maximum number of iterations before falling back to a best-effort answer.

Step 1: Setup and Dependencies#

pip install langchain==0.3.0 langchain-openai==0.2.0 langchain-community==0.3.0 \
    langgraph==0.2.0 chromadb==0.5.0 tiktoken==0.7.0 python-dotenv==1.0.1

Create a .env file:

OPENAI_API_KEY=sk-...
TAVILY_API_KEY=tvly-...  # optional, for web search fallback
# config.py
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MAX_RETRIEVAL_ITERATIONS = 3
TOP_K_DOCUMENTS = 5
RELEVANCE_THRESHOLD = 0.7

Step 2: Build the Vector Store and Retriever#

Start by indexing your documents. In production this would be a persistent Chroma or Pinecone instance. Here we index a small set of documents to demonstrate the flow.

# vectorstore.py
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader

def build_vectorstore(urls: list[str]) -> Chroma:
    """Load web pages, chunk them, and index into Chroma."""
    loader = WebBaseLoader(urls)
    docs = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
    )
    splits = splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings()
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embeddings,
        collection_name="agentic_rag_demo",
    )
    return vectorstore

# Example usage
SAMPLE_URLS = [
    "https://ai-agents-guide.com/glossary/ai-agents/",
    "https://ai-agents-guide.com/tutorials/introduction-to-rag-for-ai-agents/",
]
vectorstore = build_vectorstore(SAMPLE_URLS)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

Step 3: Query Router#

The router decides where a question should go. It uses an LLM with a structured output to classify the query.

# router.py
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

class RouteDecision(BaseModel):
    source: str = Field(
        description="Where to route: 'vectorstore', 'web_search', or 'direct_answer'"
    )
    reasoning: str = Field(description="One sentence explaining the routing decision")

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_llm = llm.with_structured_output(RouteDecision)

ROUTER_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a query router for a RAG system about AI agents.
Route to:
- 'vectorstore': questions about AI agent concepts, frameworks, tutorials in the knowledge base
- 'web_search': questions requiring real-time or recent information
- 'direct_answer': simple factual questions the LLM can answer without retrieval

Return a JSON with 'source' and 'reasoning'."""),
    ("human", "Question: {question}"),
])

router_chain = ROUTER_PROMPT | structured_llm

def route_query(question: str) -> RouteDecision:
    return router_chain.invoke({"question": question})

Step 4: Document Grader#

Before generating an answer, grade each retrieved document. This prevents irrelevant context from polluting the generation step.

# grader.py
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document

class GradeResult(BaseModel):
    relevant: bool = Field(description="True if the document is relevant to the question")
    score: float = Field(ge=0.0, le=1.0, description="Relevance score from 0 to 1")

grader_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
structured_grader = grader_llm.with_structured_output(GradeResult)

GRADER_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are grading the relevance of a retrieved document to a question.
Return relevant=True and a high score only if the document contains information
that directly helps answer the question. Be strict."""),
    ("human", "Question: {question}\n\nDocument:\n{document}"),
])

grader_chain = GRADER_PROMPT | structured_grader

def grade_documents(question: str, documents: list[Document]) -> list[Document]:
    """Filter documents by relevance score."""
    relevant_docs = []
    for doc in documents:
        result = grader_chain.invoke({
            "question": question,
            "document": doc.page_content,
        })
        if result.relevant and result.score >= 0.6:
            doc.metadata["relevance_score"] = result.score
            relevant_docs.append(doc)
    return relevant_docs

Diagram showing the agentic RAG loop with routing, grading, and self-reflection stages

Step 5: Generator with Self-Reflection#

The generator produces an answer and then runs two checks: a hallucination check (is the answer grounded in the retrieved context?) and a resolution check (does the answer actually answer the question?).

# generator.py
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document

# --- Answer Generator ---
GENERATE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are an assistant answering questions about AI agents.
Use only the provided context. If the context is insufficient, say so explicitly.

Context:
{context}"""),
    ("human", "{question}"),
])

gen_llm = ChatOpenAI(model="gpt-4o", temperature=0)
generate_chain = GENERATE_PROMPT | gen_llm

# --- Hallucination Checker ---
class HallucinationResult(BaseModel):
    grounded: bool = Field(description="True if the answer is supported by the context")

hallucination_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
hallucination_chain = ChatPromptTemplate.from_messages([
    ("system", "Does this answer rely only on the provided context? Answer grounded=True or False."),
    ("human", "Context:\n{context}\n\nAnswer:\n{answer}"),
]) | hallucination_llm.with_structured_output(HallucinationResult)

# --- Resolution Checker ---
class ResolutionResult(BaseModel):
    resolved: bool = Field(description="True if the answer fully addresses the question")

resolution_chain = ChatPromptTemplate.from_messages([
    ("system", "Does this answer fully and correctly resolve the original question?"),
    ("human", "Question: {question}\n\nAnswer: {answer}"),
]) | hallucination_llm.with_structured_output(ResolutionResult)

def generate_and_reflect(question: str, documents: list[Document]) -> dict:
    context = "\n\n".join(doc.page_content for doc in documents)
    answer = generate_chain.invoke({"question": question, "context": context})
    answer_text = answer.content

    hallucination = hallucination_chain.invoke({
        "context": context,
        "answer": answer_text,
    })
    resolution = resolution_chain.invoke({
        "question": question,
        "answer": answer_text,
    })

    return {
        "answer": answer_text,
        "grounded": hallucination.grounded,
        "resolved": resolution.resolved,
    }

Step 6: Wire Everything into a LangGraph Workflow#

# workflow.py
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langchain_core.documents import Document
from router import route_query
from grader import grade_documents
from generator import generate_and_reflect
from vectorstore import retriever
import operator

class AgentState(TypedDict):
    question: str
    documents: list[Document]
    answer: str
    iterations: int
    route: str

def route_node(state: AgentState) -> AgentState:
    decision = route_query(state["question"])
    return {**state, "route": decision.source}

def retrieve_node(state: AgentState) -> AgentState:
    if state["route"] == "vectorstore":
        docs = retriever.invoke(state["question"])
    else:
        # Fallback to vectorstore for this demo; swap with Tavily in production
        docs = retriever.invoke(state["question"])
    return {**state, "documents": docs}

def grade_node(state: AgentState) -> AgentState:
    relevant = grade_documents(state["question"], state["documents"])
    return {**state, "documents": relevant}

def generate_node(state: AgentState) -> AgentState:
    result = generate_and_reflect(state["question"], state["documents"])
    return {
        **state,
        "answer": result["answer"],
        "iterations": state.get("iterations", 0) + 1,
        "_grounded": result["grounded"],
        "_resolved": result["resolved"],
    }

def should_retry(state: AgentState) -> str:
    if state.get("iterations", 0) >= 3:
        return "end"
    if not state.get("_grounded") or not state.get("_resolved"):
        return "retry"
    return "end"

# Build graph
graph = StateGraph(AgentState)
graph.add_node("route", route_node)
graph.add_node("retrieve", retrieve_node)
graph.add_node("grade", grade_node)
graph.add_node("generate", generate_node)

graph.set_entry_point("route")
graph.add_edge("route", "retrieve")
graph.add_edge("retrieve", "grade")
graph.add_edge("grade", "generate")
graph.add_conditional_edges("generate", should_retry, {
    "retry": "retrieve",
    "end": END,
})

app = graph.compile()

# Run it
result = app.invoke({
    "question": "What is the difference between a ReAct agent and a RAG pipeline?",
    "documents": [],
    "answer": "",
    "iterations": 0,
    "route": "",
})
print(result["answer"])

Step 7: Testing and Evaluation#

Test your pipeline with queries that stress-test each branch:

# tests/test_agentic_rag.py
import pytest
from workflow import app

TEST_CASES = [
    {
        "question": "What is an AI agent?",
        "expected_keywords": ["agent", "tool", "action"],
    },
    {
        "question": "How does LangGraph differ from LangChain?",
        "expected_keywords": ["graph", "state", "workflow"],
    },
]

@pytest.mark.parametrize("case", TEST_CASES)
def test_pipeline_answers(case):
    result = app.invoke({
        "question": case["question"],
        "documents": [],
        "answer": "",
        "iterations": 0,
        "route": "",
    })
    assert result["answer"], "Pipeline should return a non-empty answer"
    answer_lower = result["answer"].lower()
    matched = sum(1 for kw in case["expected_keywords"] if kw in answer_lower)
    assert matched >= 1, f"Answer missing expected keywords: {case['expected_keywords']}"

Run with: pytest tests/test_agentic_rag.py -v

Production Considerations#

Before shipping to production, review the AI agent testing guide and the security best practices guide.

Key production checklist:

  • Rate limiting: wrap LLM calls in a retry decorator with exponential backoff
  • Observability: add tracing so you can see which nodes execute on each query (see the Langfuse observability tutorial)
  • Persistent vectorstore: replace in-memory Chroma with a persistent or cloud-hosted instance
  • Cost control: use gpt-4o-mini for routing and grading, reserve gpt-4o for final generation
  • Fallback answers: always return the best available answer after MAX_RETRIEVAL_ITERATIONS rather than raising an exception

What's Next#

  • Add memory to your system using LangGraph multi-agent patterns
  • Deploy this pipeline as a Docker service with the Docker deployment tutorial
  • Explore the Agentic RAG glossary entry to understand the theoretical background
  • Add structured tracing and evaluation with the Langfuse observability tutorial
  • Review the LangChain foundational tutorial if you need to reinforce core concepts

Related Tutorials

How to Create a Meeting Scheduling AI Agent

Build an autonomous AI agent to handle meeting scheduling, calendar checks, and bookings intelligently. This step-by-step tutorial covers Python implementation with LangChain, Google Calendar integration, and advanced features like conflict resolution for efficient automation.

How to Manage Multiple AI Agents

Master managing multiple AI agents with this in-depth tutorial. Learn orchestration, state sharing, parallel execution, and scaling using LangGraph and custom tools. From basics to production-ready swarms for complex tasks.

How to Train an AI Agent on Your Own Data

Master training AI agents on custom data with three methods: context stuffing, RAG using vector databases, and fine-tuning. This beginner-to-advanced guide includes step-by-step code examples, pitfalls, and best practices to build knowledgeable agents for your specific needs.

← Back to All Tutorials