Wikipedia page screenshot — Photo by Luke Chesser on Unsplash

Introduction to RAG for AI Agents: Build Knowledge-Grounded Agents

Large language models know a lot — but they don't know your data. Retrieval-Augmented Generation (RAG) bridges that gap by connecting AI agents to your private knowledge bases, documents, and databases. In this tutorial, you'll learn the RAG pipeline end-to-end and build a knowledge-grounded agent.

What You'll Learn#

What RAG is and why agents need it
How vector databases and embeddings work
Chunking strategies for different document types
Building a complete RAG pipeline step by step
Evaluation techniques to measure RAG quality

Prerequisites#

Understanding of AI agent architecture
Basic Python knowledge
Familiarity with what AI agents are

Why Agents Need RAG#

LLMs have three critical limitations:

Knowledge cutoff: They don't know about events after their training date
No private data access: They can't read your company's internal docs
Hallucination risk: Without grounding, they may generate plausible but incorrect answers

RAG solves all three by fetching relevant documents before the LLM generates a response.

User Query
    │
    ▼
┌─────────┐    ┌──────────────┐    ┌───────────┐
│ Embed   │ →  │ Search Vector│ →  │ Retrieve  │
│ Query   │    │ Database     │    │ Top K Docs│
└─────────┘    └──────────────┘    └─────┬─────┘
                                         │
                                         ▼
                               ┌───────────────┐
                               │ LLM generates  │
                               │ answer using   │
                               │ retrieved docs │
                               └───────────────┘

Step 1: Understand the RAG Pipeline#

The RAG pipeline has two phases:

Indexing Phase (Offline)#

Run once to prepare your knowledge base:

Load documents — PDFs, web pages, databases, Notion, Google Docs
Split into chunks — Break large docs into smaller pieces
Generate embeddings — Convert each chunk to a vector
Store in vector database — Index vectors for fast retrieval

Query Phase (Online)#

Runs every time the agent needs information:

Embed the query — Convert the user question to a vector
Similarity search — Find the most relevant document chunks
Augment the prompt — Add retrieved chunks to the LLM context
Generate response — LLM answers grounded in the retrieved data

Step 2: Chunking Strategies#

How you split documents dramatically affects retrieval quality.

Fixed-Size Chunking#

def fixed_size_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Pros: Simple, predictable chunk sizes Cons: May split mid-sentence or mid-paragraph

Semantic Chunking#

Split at natural boundaries (paragraphs, sections, headings):

def semantic_chunk(text):
    # Split by paragraphs first
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < 1000:
            current_chunk += para + "\n\n"
        else:
            chunks.append(current_chunk.strip())
            current_chunk = para + "\n\n"

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

Pros: Preserves context and meaning Cons: Variable chunk sizes, more complex

Step 3: Embeddings and Vector Databases#

Embedding Models#

An embedding model converts text into a dense vector (a list of numbers). Similar texts produce similar vectors.

| Model | Dimensions | Best For | Cost | |-------|-----------|----------|------| | OpenAI text-embedding-3-small | 1536 | General purpose | $0.02/1M tokens | | OpenAI text-embedding-3-large | 3072 | High accuracy | $0.13/1M tokens | | Cohere embed-v3 | 1024 | Multilingual | $0.10/1M tokens | | BAAI/bge-large-en-v1.5 | 1024 | Self-hosted, free | Free (compute costs) |

Vector Databases#

| Database | Type | Best For | Pricing | |----------|------|----------|---------| | Pinecone | Managed cloud | Production apps | Free tier, then per-vector | | Chroma | Open source, local | Prototyping | Free | | Weaviate | Open source + cloud | Hybrid search | Free tier available | | Qdrant | Open source + cloud | High performance | Free tier available | | pgvector | Postgres extension | Existing Postgres users | Free (extension) |

Generating and Storing Embeddings#

from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("knowledge_base")

# Generate embeddings and store
def index_documents(chunks):
    for i, chunk in enumerate(chunks):
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk
        )
        embedding = response.data[0].embedding

        collection.add(
            ids=[f"chunk_{i}"],
            embeddings=[embedding],
            documents=[chunk],
            metadatas=[{"source": "docs", "index": i}]
        )

# Query the knowledge base
def retrieve(query, top_k=5):
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    return results["documents"][0]

Step 4: Build the RAG Agent#

Combine retrieval with the LLM to create a knowledge-grounded agent:

from openai import OpenAI

client = OpenAI()

def rag_agent(user_query):
    # Step 1: Retrieve relevant documents
    relevant_docs = retrieve(user_query, top_k=5)
    context = "\n\n---\n\n".join(relevant_docs)

    # Step 2: Build augmented prompt
    system_prompt = """You are a knowledgeable support agent.
Answer questions using ONLY the provided context documents.
If the context doesn't contain the answer, say:
"I don't have enough information to answer that question."

Never make up information not found in the context."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"""Context documents:
{context}

---

User question: {user_query}

Answer based on the context above:"""}
    ]

    # Step 3: Generate grounded response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.1  # Low temperature for factual accuracy
    )

    return response.choices[0].message.content

Step 5: Evaluate RAG Quality#

RAG quality depends on two factors: retrieval quality and generation quality.

Retrieval Metrics#

| Metric | What it measures | Target | |--------|-----------------|--------| | Recall@K | % of relevant docs in top K results | > 80% | | Precision@K | % of top K results that are relevant | > 60% | | MRR (Mean Reciprocal Rank) | Position of first relevant result | > 0.7 |

Generation Metrics#

| Metric | What it measures | How to check | |--------|-----------------|-------------| | Faithfulness | Does the answer match retrieved docs? | LLM-as-judge or human review | | Relevance | Does the answer address the question? | LLM-as-judge scoring | | Completeness | Does the answer cover all relevant info? | Checklist comparison |

Quick Evaluation Script#

def evaluate_rag(test_cases):
    results = []
    for case in test_cases:
        query = case["query"]
        expected = case["expected_answer"]

        # Get RAG response
        response = rag_agent(query)

        # Check if key facts are present
        facts_found = sum(
            1 for fact in case["key_facts"]
            if fact.lower() in response.lower()
        )
        accuracy = facts_found / len(case["key_facts"])

        results.append({
            "query": query,
            "accuracy": accuracy,
            "response": response
        })

    avg_accuracy = sum(r["accuracy"] for r in results) / len(results)
    print(f"Average accuracy: {avg_accuracy:.1%}")
    return results

Advanced RAG Techniques#

Hybrid Search#

Combine vector similarity with keyword search for better results:

# Pseudo-code for hybrid retrieval
def hybrid_retrieve(query, top_k=5, alpha=0.7):
    vector_results = vector_search(query, top_k=top_k * 2)
    keyword_results = bm25_search(query, top_k=top_k * 2)

    # Weighted combination
    combined = {}
    for doc, score in vector_results:
        combined[doc] = alpha * score
    for doc, score in keyword_results:
        combined[doc] = combined.get(doc, 0) + (1 - alpha) * score

    # Return top K by combined score
    sorted_docs = sorted(combined.items(), key=lambda x: -x[1])
    return [doc for doc, score in sorted_docs[:top_k]]

Re-ranking#

After initial retrieval, use a cross-encoder to re-rank results for higher precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, documents, top_k=3):
    pairs = [(query, doc) for doc in documents]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(documents, scores),
        key=lambda x: -x[1]
    )
    return [doc for doc, score in ranked[:top_k]]

Metadata Filtering#

Use document metadata to narrow search scope before vector retrieval:

# Only search in specific categories
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={"category": "product-docs"}  # Metadata filter
)

Common Mistakes to Avoid#

Chunks too large: The LLM's attention dilutes with too much context — keep chunks focused
No overlap between chunks: Critical information at chunk boundaries gets lost
Using the wrong embedding model: Match the model to your domain and language
Ignoring retrieval evaluation: A great LLM can't compensate for bad retrieval
No fallback for missing knowledge: Always tell users when the answer isn't in your knowledge base

Next Steps#

Build an AI Agent with LangChain — implement RAG agents with a framework
AI Agent for Customer Service — apply RAG to support use cases
Prompt Engineering for AI Agents — optimize your RAG prompts

Frequently Asked Questions#

How many documents can RAG handle?#

Vector databases can scale to millions of documents efficiently. The bottleneck is usually the initial indexing time and embedding costs, not query-time performance. Most vector databases handle 10M+ vectors with sub-second query times.

How often should I re-index my knowledge base?#

It depends on how frequently your data changes. For static knowledge bases (product docs, policies), re-index weekly or on content updates. For dynamic data (support tickets, news), consider real-time or hourly indexing pipelines.

RAG vs. fine-tuning — when should I use each?#

Use RAG when you need the agent to access frequently changing information or large document collections. Use fine-tuning when you need the agent to learn a specific writing style, domain vocabulary, or reasoning pattern. Many production systems use both.

What chunk size should I start with?#

Start with 500 tokens and 50 token overlap. Test with your actual queries and adjust. If your answers seem incomplete, try larger chunks. If retrieval precision is low, try smaller chunks. There's no universal optimal size — it depends on your documents and queries.