Introduction to RAG for AI Agents: Build Knowledge-Grounded Agents
Large language models know a lot β but they don't know your data. Retrieval-Augmented Generation (RAG) bridges that gap by connecting AI agents to your private knowledge bases, documents, and databases. In this tutorial, you'll learn the RAG pipeline end-to-end and build a knowledge-grounded agent.
What You'll Learn#
- What RAG is and why agents need it
- How vector databases and embeddings work
- Chunking strategies for different document types
- Building a complete RAG pipeline step by step
- Evaluation techniques to measure RAG quality
Prerequisites#
- Understanding of AI agent architecture
- Basic Python knowledge
- Familiarity with what AI agents are
Why Agents Need RAG#
LLMs have three critical limitations:
- Knowledge cutoff: They don't know about events after their training date
- No private data access: They can't read your company's internal docs
- Hallucination risk: Without grounding, they may generate plausible but incorrect answers
RAG solves all three by fetching relevant documents before the LLM generates a response.
User Query
β
βΌ
βββββββββββ ββββββββββββββββ βββββββββββββ
β Embed β β β Search Vectorβ β β Retrieve β
β Query β β Database β β Top K Docsβ
βββββββββββ ββββββββββββββββ βββββββ¬ββββββ
β
βΌ
βββββββββββββββββ
β LLM generates β
β answer using β
β retrieved docs β
βββββββββββββββββ
Step 1: Understand the RAG Pipeline#
The RAG pipeline has two phases:
Indexing Phase (Offline)#
Run once to prepare your knowledge base:
- Load documents β PDFs, web pages, databases, Notion, Google Docs
- Split into chunks β Break large docs into smaller pieces
- Generate embeddings β Convert each chunk to a vector
- Store in vector database β Index vectors for fast retrieval
Query Phase (Online)#
Runs every time the agent needs information:
- Embed the query β Convert the user question to a vector
- Similarity search β Find the most relevant document chunks
- Augment the prompt β Add retrieved chunks to the LLM context
- Generate response β LLM answers grounded in the retrieved data
Step 2: Chunking Strategies#
How you split documents dramatically affects retrieval quality.
Fixed-Size Chunking#
def fixed_size_chunk(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
Pros: Simple, predictable chunk sizes Cons: May split mid-sentence or mid-paragraph
Semantic Chunking#
Split at natural boundaries (paragraphs, sections, headings):
def semantic_chunk(text):
# Split by paragraphs first
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
for para in paragraphs:
if len(current_chunk) + len(para) < 1000:
current_chunk += para + "\n\n"
else:
chunks.append(current_chunk.strip())
current_chunk = para + "\n\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
Pros: Preserves context and meaning Cons: Variable chunk sizes, more complex
Recommended Settings by Document Type#
| Document Type | Chunk Size | Overlap | Strategy | |---------------|-----------|---------|----------| | Knowledge base articles | 500-800 tokens | 50 tokens | Semantic | | Legal documents | 1000-1500 tokens | 100 tokens | Section-based | | Code documentation | 300-500 tokens | 30 tokens | Function/class-based | | Chat transcripts | 200-400 tokens | 20 tokens | Message-based | | Research papers | 800-1200 tokens | 80 tokens | Paragraph-based |
Step 3: Embeddings and Vector Databases#
Embedding Models#
An embedding model converts text into a dense vector (a list of numbers). Similar texts produce similar vectors.
| Model | Dimensions | Best For | Cost | |-------|-----------|----------|------| | OpenAI text-embedding-3-small | 1536 | General purpose | $0.02/1M tokens | | OpenAI text-embedding-3-large | 3072 | High accuracy | $0.13/1M tokens | | Cohere embed-v3 | 1024 | Multilingual | $0.10/1M tokens | | BAAI/bge-large-en-v1.5 | 1024 | Self-hosted, free | Free (compute costs) |
Vector Databases#
| Database | Type | Best For | Pricing | |----------|------|----------|---------| | Pinecone | Managed cloud | Production apps | Free tier, then per-vector | | Chroma | Open source, local | Prototyping | Free | | Weaviate | Open source + cloud | Hybrid search | Free tier available | | Qdrant | Open source + cloud | High performance | Free tier available | | pgvector | Postgres extension | Existing Postgres users | Free (extension) |
Generating and Storing Embeddings#
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")
# Generate embeddings and store
def index_documents(chunks):
for i, chunk in enumerate(chunks):
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunk
)
embedding = response.data[0].embedding
collection.add(
ids=[f"chunk_{i}"],
embeddings=[embedding],
documents=[chunk],
metadatas=[{"source": "docs", "index": i}]
)
# Query the knowledge base
def retrieve(query, top_k=5):
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
return results["documents"][0]
Step 4: Build the RAG Agent#
Combine retrieval with the LLM to create a knowledge-grounded agent:
from openai import OpenAI
client = OpenAI()
def rag_agent(user_query):
# Step 1: Retrieve relevant documents
relevant_docs = retrieve(user_query, top_k=5)
context = "\n\n---\n\n".join(relevant_docs)
# Step 2: Build augmented prompt
system_prompt = """You are a knowledgeable support agent.
Answer questions using ONLY the provided context documents.
If the context doesn't contain the answer, say:
"I don't have enough information to answer that question."
Never make up information not found in the context."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"""Context documents:
{context}
---
User question: {user_query}
Answer based on the context above:"""}
]
# Step 3: Generate grounded response
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.1 # Low temperature for factual accuracy
)
return response.choices[0].message.content
Step 5: Evaluate RAG Quality#
RAG quality depends on two factors: retrieval quality and generation quality.
Retrieval Metrics#
| Metric | What it measures | Target | |--------|-----------------|--------| | Recall@K | % of relevant docs in top K results | > 80% | | Precision@K | % of top K results that are relevant | > 60% | | MRR (Mean Reciprocal Rank) | Position of first relevant result | > 0.7 |
Generation Metrics#
| Metric | What it measures | How to check | |--------|-----------------|-------------| | Faithfulness | Does the answer match retrieved docs? | LLM-as-judge or human review | | Relevance | Does the answer address the question? | LLM-as-judge scoring | | Completeness | Does the answer cover all relevant info? | Checklist comparison |
Quick Evaluation Script#
def evaluate_rag(test_cases):
results = []
for case in test_cases:
query = case["query"]
expected = case["expected_answer"]
# Get RAG response
response = rag_agent(query)
# Check if key facts are present
facts_found = sum(
1 for fact in case["key_facts"]
if fact.lower() in response.lower()
)
accuracy = facts_found / len(case["key_facts"])
results.append({
"query": query,
"accuracy": accuracy,
"response": response
})
avg_accuracy = sum(r["accuracy"] for r in results) / len(results)
print(f"Average accuracy: {avg_accuracy:.1%}")
return results
Advanced RAG Techniques#
Hybrid Search#
Combine vector similarity with keyword search for better results:
# Pseudo-code for hybrid retrieval
def hybrid_retrieve(query, top_k=5, alpha=0.7):
vector_results = vector_search(query, top_k=top_k * 2)
keyword_results = bm25_search(query, top_k=top_k * 2)
# Weighted combination
combined = {}
for doc, score in vector_results:
combined[doc] = alpha * score
for doc, score in keyword_results:
combined[doc] = combined.get(doc, 0) + (1 - alpha) * score
# Return top K by combined score
sorted_docs = sorted(combined.items(), key=lambda x: -x[1])
return [doc for doc, score in sorted_docs[:top_k]]
Re-ranking#
After initial retrieval, use a cross-encoder to re-rank results for higher precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, documents, top_k=3):
pairs = [(query, doc) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(
zip(documents, scores),
key=lambda x: -x[1]
)
return [doc for doc, score in ranked[:top_k]]
Metadata Filtering#
Use document metadata to narrow search scope before vector retrieval:
# Only search in specific categories
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={"category": "product-docs"} # Metadata filter
)
Common Mistakes to Avoid#
- Chunks too large: The LLM's attention dilutes with too much context β keep chunks focused
- No overlap between chunks: Critical information at chunk boundaries gets lost
- Using the wrong embedding model: Match the model to your domain and language
- Ignoring retrieval evaluation: A great LLM can't compensate for bad retrieval
- No fallback for missing knowledge: Always tell users when the answer isn't in your knowledge base
Next Steps#
- Build an AI Agent with LangChain β implement RAG agents with a framework
- AI Agent for Customer Service β apply RAG to support use cases
- Prompt Engineering for AI Agents β optimize your RAG prompts
Frequently Asked Questions#
How many documents can RAG handle?#
Vector databases can scale to millions of documents efficiently. The bottleneck is usually the initial indexing time and embedding costs, not query-time performance. Most vector databases handle 10M+ vectors with sub-second query times.
How often should I re-index my knowledge base?#
It depends on how frequently your data changes. For static knowledge bases (product docs, policies), re-index weekly or on content updates. For dynamic data (support tickets, news), consider real-time or hourly indexing pipelines.
RAG vs. fine-tuning β when should I use each?#
Use RAG when you need the agent to access frequently changing information or large document collections. Use fine-tuning when you need the agent to learn a specific writing style, domain vocabulary, or reasoning pattern. Many production systems use both.
What chunk size should I start with?#
Start with 500 tokens and 50 token overlap. Test with your actual queries and adjust. If your answers seem incomplete, try larger chunks. If retrieval precision is low, try smaller chunks. There's no universal optimal size β it depends on your documents and queries.