Agentic RAG Examples: Multi-Step Retrieval Workflows
Standard retrieval-augmented generation (RAG) is a pipeline: embed the query, retrieve top-k documents, generate. Agentic RAG transforms retrieval into a decision loop — the agent decides what to retrieve, evaluates whether what it found is useful, and retries or routes differently when it is not. This produces significantly higher answer quality on complex questions while staying grounded in your actual knowledge base.
These six examples move from simple query routing to advanced self-correction and corrective retry. Each includes working Python code you can integrate into your own RAG stack. For foundational concepts, read the Agentic RAG Tutorial first, then come back here for concrete implementation patterns.
Example 1: Query Routing RAG#
Use Case: Route each incoming query to the most relevant knowledge base — product documentation, support tickets, or pricing tables — rather than broadcasting to all sources and merging noisy results.
Architecture: Query classifier LLM → routing decision → targeted retriever → generation. The routing step costs only a small model call and dramatically reduces retrieval noise for domain-specific queries.
Key Implementation:
from openai import OpenAI
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
client = OpenAI()
embeddings = OpenAIEmbeddings()
# Separate indices for different knowledge domains
product_docs_index = FAISS.load_local("indexes/product_docs", embeddings, allow_dangerous_deserialization=True)
support_tickets_index = FAISS.load_local("indexes/support_tickets", embeddings, allow_dangerous_deserialization=True)
pricing_index = FAISS.load_local("indexes/pricing", embeddings, allow_dangerous_deserialization=True)
RETRIEVERS = {
"product_docs": product_docs_index.as_retriever(search_kwargs={"k": 4}),
"support_tickets": support_tickets_index.as_retriever(search_kwargs={"k": 4}),
"pricing": pricing_index.as_retriever(search_kwargs={"k": 3}),
}
def route_query(query: str) -> str:
"""Classify query to determine which knowledge base to search."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Classify this query to exactly one knowledge base.
Options: product_docs, support_tickets, pricing
Query: {query}
Return only the knowledge base name, nothing else."""
}]
)
return response.choices[0].message.content.strip()
def routed_rag(query: str) -> str:
target = route_query(query)
print(f"Routing to: {target}")
retriever = RETRIEVERS.get(target, RETRIEVERS["product_docs"])
docs = retriever.invoke(query)
context = "\n\n".join(d.page_content for d in docs)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this {target} context:\n\n{context}"},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
print(routed_rag("What is the enterprise pricing for 50 seats?"))
print(routed_rag("How do I reset my API key?"))
Outcome: Queries go to the right knowledge base rather than polluting context with irrelevant documents from other domains. Routing accuracy with a small model is typically above 90% for well-defined domain boundaries, and the added cost is negligible compared to the quality improvement.
Example 2: Self-Correcting RAG with Hallucination Detection#
Use Case: Automatically detect when the model's answer is not grounded in the retrieved documents and trigger a retry with a reformulated query — catching hallucinations before they reach the user.
Architecture: Retrieve → generate → grounding checker → retry loop. The grounding checker uses a separate LLM call to score whether every claim in the answer is supported by the retrieved context. Below threshold, the query is reformulated and retrieval repeats.
Key Implementation:
from anthropic import Anthropic
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
client = Anthropic()
vectorstore = FAISS.load_local("indexes/knowledge_base", OpenAIEmbeddings(), allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
def retrieve_and_generate(query: str) -> tuple[str, list]:
docs = retriever.invoke(query)
context = "\n\n".join(f"[Doc {i+1}]: {d.page_content}" for i, d in enumerate(docs))
response = client.messages.create(
model="claude-3-5-sonnet-20241022", max_tokens=800,
messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based only on the context above."}]
)
return response.content[0].text, docs
def check_grounding(answer: str, docs: list) -> tuple[bool, str]:
"""Verify every claim in the answer is supported by retrieved documents."""
context = "\n\n".join(d.page_content for d in docs)
response = client.messages.create(
model="claude-3-5-haiku-20241022", max_tokens=200,
messages=[{"role": "user", "content": f"""
Is this answer fully supported by the context? Answer YES or NO followed by one sentence.
Context: {context[:2000]}
Answer: {answer}
"""}]
)
text = response.content[0].text
is_grounded = text.strip().upper().startswith("YES")
return is_grounded, text
def reformulate_query(original: str, issue: str) -> str:
response = client.messages.create(
model="claude-3-5-haiku-20241022", max_tokens=100,
messages=[{"role": "user", "content": f"Rewrite this query to find more specific information.\nOriginal: {original}\nIssue: {issue}\nNew query (one sentence only):"}]
)
return response.content[0].text.strip()
def self_correcting_rag(query: str, max_retries: int = 2) -> str:
for attempt in range(max_retries + 1):
answer, docs = retrieve_and_generate(query)
is_grounded, explanation = check_grounding(answer, docs)
if is_grounded:
return answer
print(f"Attempt {attempt + 1}: Not grounded — {explanation}")
if attempt < max_retries:
query = reformulate_query(query, explanation)
print(f"Reformulated query: {query}")
return answer # Return best attempt
print(self_correcting_rag("What are the latency SLAs for the enterprise tier?"))
Outcome: Hallucination rate drops significantly because ungrounded answers trigger automatic retries rather than reaching the user. The grounding check adds one cheap model call per generation cycle — a worthwhile trade-off for factual accuracy in production RAG systems.
Example 3: Multi-Document RAG with Reranking#
Use Case: Retrieve from multiple document collections simultaneously, then use cross-encoder reranking to select only the genuinely relevant passages before generating — avoiding the quality degradation that comes from stuffing irrelevant context into the prompt.
Architecture: Parallel multi-index retrieval → candidate pool → cross-encoder reranker → top-k selection → generation with source attribution.
Key Implementation:
from openai import OpenAI
from sentence_transformers import CrossEncoder
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
client = OpenAI()
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
embeddings = OpenAIEmbeddings()
indexes = {
"technical_docs": FAISS.load_local("indexes/technical", embeddings, allow_dangerous_deserialization=True),
"release_notes": FAISS.load_local("indexes/releases", embeddings, allow_dangerous_deserialization=True),
"api_reference": FAISS.load_local("indexes/api", embeddings, allow_dangerous_deserialization=True),
}
def retrieve_from_all(query: str, k_per_index: int = 5) -> list[tuple[str, str, str]]:
"""Retrieve from all indexes. Returns list of (source, content, doc_id) tuples."""
all_docs = []
for source_name, index in indexes.items():
docs = index.as_retriever(search_kwargs={"k": k_per_index}).invoke(query)
for doc in docs:
all_docs.append((source_name, doc.page_content, doc.metadata.get("source", "")))
return all_docs
def rerank_documents(query: str, docs: list[tuple], top_k: int = 5) -> list[tuple]:
"""Use cross-encoder to rerank candidates by actual relevance."""
pairs = [(query, doc[1]) for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, docs), reverse=True)
return [doc for _, doc in ranked[:top_k]]
def multi_doc_rag(query: str) -> str:
candidates = retrieve_from_all(query)
top_docs = rerank_documents(query, candidates, top_k=5)
context_blocks = [f"[Source: {source}]\n{content}" for source, content, _ in top_docs]
context = "\n\n---\n\n".join(context_blocks)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using the provided multi-source context. Cite each source as [Source: name]."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response.choices[0].message.content
print(multi_doc_rag("What changed in the rate limiting behavior in the latest API version?"))
Outcome: Reranking cuts the irrelevant context that degrades generation quality. The cross-encoder scores actual query-document relevance rather than just embedding similarity, which is particularly valuable when queries involve technical terms that appear in many documents with different meanings.
Example 4: Iterative RAG with Web Fallback#
Use Case: For questions the internal knowledge base cannot fully answer — because the information is too recent or simply not in the corpus — automatically fall back to a live web search and integrate both sources.
Architecture: Initial retrieval → relevance gate → if insufficient, trigger web search → combine internal and web context → generate with source attribution.
Key Implementation:
from anthropic import Anthropic
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.tools.tavily_search import TavilySearchResults
client = Anthropic()
vectorstore = FAISS.load_local("indexes/knowledge_base", OpenAIEmbeddings(), allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
web_search = TavilySearchResults(max_results=3)
def assess_retrieval_quality(query: str, docs: list) -> tuple[bool, str]:
"""Check whether retrieved docs sufficiently answer the query."""
context = "\n".join(d.page_content[:300] for d in docs[:3])
response = client.messages.create(
model="claude-3-5-haiku-20241022", max_tokens=150,
messages=[{"role": "user", "content": f"""
Does this context contain enough information to fully answer the query?
Query: {query}
Context preview: {context}
Answer with SUFFICIENT or INSUFFICIENT and one sentence why.
"""}]
)
text = response.content[0].text
return text.strip().upper().startswith("SUFFICIENT"), text
def iterative_rag_with_fallback(query: str) -> str:
# Step 1: Try internal knowledge base
internal_docs = retriever.invoke(query)
is_sufficient, assessment = assess_retrieval_quality(query, internal_docs)
print(f"Internal retrieval: {assessment}")
context_parts = []
if internal_docs:
internal_context = "\n\n".join(d.page_content for d in internal_docs)
context_parts.append(f"[Internal Knowledge Base]\n{internal_context}")
# Step 2: Fall back to web search if internal is insufficient
if not is_sufficient:
print("Falling back to web search...")
web_results = web_search.invoke(query)
web_context = "\n\n".join(f"[Web - {r['url']}]\n{r['content']}" for r in web_results)
context_parts.append(f"[Web Search Results]\n{web_context}")
combined_context = "\n\n===\n\n".join(context_parts)
response = client.messages.create(
model="claude-3-5-sonnet-20241022", max_tokens=1000,
messages=[{"role": "user", "content": f"""
Answer the question using the provided context. Indicate which source supports each key point.
Context:
{combined_context}
Question: {query}
"""}]
)
return response.content[0].text
print(iterative_rag_with_fallback("What are the new features in the 3.2 release announced last week?"))
Outcome: Queries about recent events or information gaps in your corpus get useful answers instead of "I don't have enough information." The relevance gate prevents unnecessary web calls for questions your knowledge base handles well, keeping latency and API costs low for the common case.
Example 5: Conversational RAG with Memory#
Use Case: Maintain conversation history across multiple turns so the agent can handle follow-up questions like "tell me more about the third point" or "compare that with what you said earlier" without losing context from prior retrieval steps.
Architecture: Chat history store + query contextualization (rewrite follow-up questions to be standalone) + retrieval + generation with full history. The critical step is query rewriting — naive follow-up questions like "what about the pricing?" fail vector search without the conversation context baked in.
Key Implementation:
from anthropic import Anthropic
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_core.messages import HumanMessage, AIMessage
client = Anthropic()
vectorstore = FAISS.load_local("indexes/knowledge_base", OpenAIEmbeddings(), allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
def contextualize_query(query: str, chat_history: list) -> str:
"""Rewrite a follow-up query as a standalone question for vector search."""
if not chat_history:
return query
history_text = "\n".join(
f"{'User' if isinstance(m, HumanMessage) else 'Assistant'}: {m.content}"
for m in chat_history[-6:] # Last 3 turns
)
response = client.messages.create(
model="claude-3-5-haiku-20241022", max_tokens=100,
messages=[{"role": "user", "content": f"""
Given this conversation history, rewrite the follow-up question as a standalone question.
History: {history_text}
Follow-up: {query}
Standalone question (one sentence only):
"""}]
)
return response.content[0].text.strip()
def conversational_rag(query: str, chat_history: list) -> tuple[str, list]:
# Rewrite query using conversation context for better retrieval
standalone_query = contextualize_query(query, chat_history)
print(f"Contextualized: {standalone_query}")
docs = retriever.invoke(standalone_query)
context = "\n\n".join(d.page_content for d in docs)
# Build full conversation for generation
history_messages = [
{"role": "user" if isinstance(m, HumanMessage) else "assistant", "content": m.content}
for m in chat_history
]
history_messages.append({
"role": "user",
"content": f"Context from knowledge base:\n{context}\n\nQuestion: {query}"
})
response = client.messages.create(
model="claude-3-5-sonnet-20241022", max_tokens=600,
system="You are a helpful assistant. Answer using the provided context and conversation history.",
messages=history_messages
)
answer = response.content[0].text
chat_history.append(HumanMessage(content=query))
chat_history.append(AIMessage(content=answer))
return answer, chat_history
history = []
answer, history = conversational_rag("What authentication methods does the API support?", history)
print(f"Turn 1: {answer}\n")
answer, history = conversational_rag("Which of those is recommended for production?", history)
print(f"Turn 2: {answer}")
Outcome: Follow-up questions that reference prior context work reliably because the standalone query rewrite makes the retrieval query specific enough for vector search. Without this step, follow-up questions like "what about pricing?" would return unrelated documents.
Example 6: Corrective RAG with Grade-and-Retry#
Use Case: Build a fully automated quality gate that grades every retrieved document for relevance, discards low-quality ones, generates only from high-quality context, and retries with web search if not enough documents pass the grade threshold.
Architecture: Retrieve → grade each document individually → keep only grade A/B documents → if fewer than 3 pass, add web results → generate. This is the CRAG (Corrective RAG) pattern applied end to end.
Key Implementation:
from openai import OpenAI
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.tools.tavily_search import TavilySearchResults
from pydantic import BaseModel
import json
client = OpenAI()
vectorstore = FAISS.load_local("indexes/knowledge_base", OpenAIEmbeddings(), allow_dangerous_deserialization=True)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
web_search = TavilySearchResults(max_results=3)
class DocumentGrade(BaseModel):
grade: str # "A" = highly relevant, "B" = useful background, "C" = not relevant
reason: str
def grade_document(query: str, document: str) -> DocumentGrade:
response = client.chat.completions.create(
model="gpt-4o-mini",
response_format={"type": "json_object"},
messages=[{"role": "user", "content": f"""
Grade this document's relevance to the query.
A = directly answers the query
B = provides useful background context
C = not relevant
Query: {query}
Document: {document[:500]}
Return JSON: {{"grade": "A"|"B"|"C", "reason": "..."}}
"""}]
)
return DocumentGrade(**json.loads(response.choices[0].message.content))
def corrective_rag(query: str) -> str:
# Step 1: Retrieve and grade each document
raw_docs = retriever.invoke(query)
graded_docs = []
for doc in raw_docs:
grade = grade_document(query, doc.page_content)
print(f"Grade {grade.grade}: {doc.page_content[:80]}...")
if grade.grade in ("A", "B"):
graded_docs.append(doc)
# Step 2: Supplement with web search if insufficient high-quality docs
if len(graded_docs) < 3:
print(f"Only {len(graded_docs)} quality docs — supplementing with web search")
web_results = web_search.invoke(query)
web_texts = [r["content"] for r in web_results]
else:
web_texts = []
# Step 3: Build final context from graded docs + optional web results
context_parts = [doc.page_content for doc in graded_docs] + web_texts
context = "\n\n---\n\n".join(context_parts)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer concisely using only information from the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
return response.choices[0].message.content
print(corrective_rag("How does the rate limiting work for burst traffic?"))
Outcome: Generation happens only from documents that passed the relevance grade, eliminating the garbage-in-garbage-out problem that plagues naive RAG. The grade-and-retry loop consistently produces higher answer quality than fixed top-k retrieval, especially for precise factual questions where most retrieved documents are only tangentially related.
Choosing the Right Agentic RAG Pattern#
Start with query routing (Example 1) if you have multiple distinct knowledge bases — it is the highest-leverage lowest-complexity improvement over basic RAG. Add the grounding checker from Example 2 if hallucination rate is your primary concern. Implement corrective RAG (Example 6) for production systems where answer quality must be consistent and measurable across diverse query types.
Conversational RAG (Example 5) is essential for any chat interface — without query contextualization, multi-turn conversations fail retrieval silently and users experience a noticeable quality drop on follow-up questions.
Getting Started#
The Agentic RAG Tutorial walks through setting up the retrieval infrastructure for these patterns end to end. For framework-specific implementations, see LangChain Agent Examples and OpenAI Agents SDK Examples. For a detailed framework comparison, review LangChain vs LlamaIndex.
For production RAG deployments, the AI Agent Evaluation Guide covers how to measure retrieval quality, answer grounding, and end-to-end pipeline performance so you can track improvements as you add each pattern from this guide.
Frequently Asked Questions#
The FAQ section renders from the frontmatter faq array above.