How to Train an AI Agent on Your Own Data | AI Agents Guide

Library interior with rows of bookshelves receding into the distance — Photo by Iñaki del Olmo on Unsplash

When people say they want to "train an AI agent on their own data," they usually mean one of three very different things. Conflating them is the most common mistake teams make when starting this journey — and it leads to weeks of wasted effort.

This tutorial covers all three approaches clearly, tells you which one is right for your situation, and provides complete working code for each.

For background on how AI agents use memory and context, see AI Agent Memory and Retrieval-Augmented Generation (RAG).

The Critical Distinction: RAG vs. Fine-Tuning vs. Context Injection#

Before writing any code, understand the difference:

| Approach | What changes | Best for | Cost | |----------|-------------|----------|------| | RAG | Nothing (model unchanged) | Factual knowledge, large document sets | Low — just embedding and storage | | Fine-tuning | Model weights | Style, format, specialized vocabulary | Medium — training + higher inference cost | | Context injection | Nothing (prompt-level only) | Small datasets, < 20 pages of content | Near-zero |

The rule of thumb: If you want the agent to know facts from your documents, use RAG. If you want the agent to respond in a specific way (tone, format, domain jargon), use fine-tuning. If you have a small, stable reference document that fits in a context window, use context injection.

For nearly every team asking "how do I train an AI agent on my data," the correct answer is RAG.

Approach 1: RAG (Retrieval-Augmented Generation)#

RAG is the standard architecture for giving an AI agent access to large, domain-specific knowledge bases. The agent doesn't memorize your data — it looks it up at runtime, just like a human researcher using a reference library.

How RAG Works#

Your documents are chunked into small passages
Each chunk is converted to a vector embedding (a numerical representation of meaning)
Embeddings are stored in a vector database
At query time, the user's question is also converted to an embedding
The most semantically similar chunks are retrieved from the database
Those chunks are injected into the LLM prompt as context
The LLM generates an answer grounded in the retrieved content

Install Dependencies#

pip install langchain langchain-openai langchain-community chromadb \
            pypdf python-dotenv tiktoken

Step 1: Load and Chunk Documents#

# rag_pipeline.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import (
    PyPDFLoader,
    CSVLoader,
    WebBaseLoader,
    TextLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

def load_documents(source_type: str, source_path: str):
    """Load documents from various sources."""

    loaders = {
        "pdf": PyPDFLoader,
        "csv": CSVLoader,
        "web": WebBaseLoader,
        "text": TextLoader
    }

    loader_class = loaders.get(source_type)
    if not loader_class:
        raise ValueError(f"Unsupported source type: {source_type}")

    loader = loader_class(source_path)
    documents = loader.load()
    print(f"Loaded {len(documents)} document(s) from {source_path}")
    return documents


def chunk_documents(documents, chunk_size=800, chunk_overlap=100):
    """
    Split documents into chunks for embedding.

    chunk_size=800: tokens per chunk (good for dense reference content)
    chunk_overlap=100: overlap between chunks to preserve context at boundaries
    RecursiveCharacterTextSplitter tries to split on natural boundaries:
    paragraphs → sentences → words → characters
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )

    chunks = splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks from {len(documents)} documents")
    return chunks


# Example: load a PDF and a web page together
if __name__ == "__main__":
    # Load a PDF (replace with your actual file path)
    pdf_docs = load_documents("pdf", "data/product_documentation.pdf")

    # Load a web page
    web_docs = load_documents("web", "https://your-internal-wiki.com/policies")

    # Combine all documents
    all_docs = pdf_docs + web_docs

    # Chunk them
    chunks = chunk_documents(all_docs, chunk_size=800, chunk_overlap=100)
    print(f"\nFirst chunk preview:\n{chunks[0].page_content[:300]}...")

Step 2: Embed and Store in Vector Database#

# vector_store.py
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

load_dotenv()

def create_vector_store(chunks, persist_directory="./chroma_db"):
    """
    Embed document chunks and store in Chroma vector database.

    OpenAI text-embedding-3-small is the recommended embedding model:
    - 1536 dimensions, excellent quality
    - Cost: $0.02 per 1M tokens — cheap for typical document sets
    """
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    # Create vector store from documents
    # Chroma stores embeddings locally in persist_directory
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory
    )

    print(f"Vector store created with {vector_store._collection.count()} vectors")
    print(f"Stored at: {persist_directory}")
    return vector_store


def load_vector_store(persist_directory="./chroma_db"):
    """Load an existing vector store from disk."""
    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    vector_store = Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings
    )
    print(f"Loaded vector store: {vector_store._collection.count()} vectors")
    return vector_store


def test_retrieval(vector_store, query: str, k: int = 4):
    """Test that retrieval is working correctly."""
    results = vector_store.similarity_search(query, k=k)
    print(f"\nQuery: '{query}'")
    print(f"Retrieved {len(results)} chunks:")
    for i, doc in enumerate(results, 1):
        print(f"\n  Chunk {i} (source: {doc.metadata.get('source', 'unknown')}):")
        print(f"  {doc.page_content[:200]}...")
    return results

Step 3: Build the RAG Agent#

# rag_agent.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

load_dotenv()

def build_rag_agent(vector_store):
    """
    Build a RAG agent using LangChain's RetrievalQA chain.

    This agent will:
    1. Receive a user question
    2. Retrieve relevant chunks from the vector store
    3. Inject them into the prompt as context
    4. Generate an answer grounded in the retrieved content
    """
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)

    # Custom prompt that instructs the LLM to use context
    rag_prompt = PromptTemplate(
        input_variables=["context", "question"],
        template=(
            "You are a knowledgeable assistant with access to our internal documentation. "
            "Use the following retrieved context to answer the question accurately.\n\n"
            "If the context does not contain enough information to answer confidently, "
            "say so explicitly rather than guessing.\n\n"
            "CONTEXT:\n{context}\n\n"
            "QUESTION: {question}\n\n"
            "ANSWER:"
        )
    )

    # Build RetrievalQA chain
    rag_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",  # "stuff" puts all chunks into one prompt
        retriever=vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 5}  # Retrieve top 5 most relevant chunks
        ),
        return_source_documents=True,  # Always show what was retrieved
        chain_type_kwargs={"prompt": rag_prompt}
    )

    return rag_chain


def ask_agent(rag_chain, question: str) -> dict:
    """Query the RAG agent and display results with source attribution."""
    result = rag_chain.invoke({"query": question})

    print(f"\nQ: {question}")
    print(f"\nA: {result['result']}")
    print(f"\nSources used:")
    for doc in result.get("source_documents", []):
        source = doc.metadata.get("source", "unknown")
        print(f"  - {source}: {doc.page_content[:100]}...")

    return result


# Example usage
if __name__ == "__main__":
    from rag_pipeline import load_documents, chunk_documents
    from vector_store import create_vector_store

    # Load your documents (replace with real paths)
    docs = load_documents("text", "data/company_faq.txt")
    chunks = chunk_documents(docs)
    vector_store = create_vector_store(chunks)

    # Build agent
    agent = build_rag_agent(vector_store)

    # Test it
    questions = [
        "What is our refund policy for enterprise customers?",
        "How do I reset my API key?",
        "What are the data retention settings?"
    ]

    for q in questions:
        ask_agent(agent, q)
        print("-" * 60)

Approach 2: Fine-Tuning#

Fine-tuning changes the model's weights by training on your examples. Use it when you need the agent to adopt a specific style, format, or domain vocabulary — not when you need it to know facts.

When fine-tuning is worth it:

Your agent needs a very specific response format (structured JSON, a proprietary template)
Extensive domain jargon that the base model doesn't know well (medical, legal, highly specialized engineering)
Consistent tone and persona that is hard to enforce via prompting
Reducing prompt length at high inference volume (fine-tuned models need less in-context instruction)

When fine-tuning is NOT worth it:

You want the agent to know your product documentation — use RAG
You want the agent to answer from your latest data — RAG is always more current
You're on a tight timeline — fine-tuning requires data preparation and iteration

# fine_tuning_example.py
import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client = OpenAI()

# ── Step 1: Prepare Training Data ─────────────────────────────────────────
# Fine-tuning requires JSONL format: one training example per line
# Each example is a conversation with system, user, and assistant turns

def create_training_examples():
    """
    Create fine-tuning training examples.

    For a customer support agent, each example shows:
    - The system role behavior
    - A user message
    - The ideal assistant response

    Minimum 10 examples recommended; 50-100+ for meaningful behavior change.
    """
    examples = [
        {
            "messages": [
                {"role": "system", "content": "You are Aria, a support agent for Acme SaaS. Always respond in a professional, concise tone. Lead with the solution, not the empathy."},
                {"role": "user", "content": "I can't log into my account. It says my password is wrong but I just reset it."},
                {"role": "assistant", "content": "Your new password may take 2-3 minutes to propagate across our systems. Please wait 3 minutes and try again. If the issue persists, clear your browser cache (Ctrl+Shift+Delete) before logging in. Still locked out? I can escalate this to our auth team with a 1-hour SLA."}
            ]
        },
        {
            "messages": [
                {"role": "system", "content": "You are Aria, a support agent for Acme SaaS. Always respond in a professional, concise tone. Lead with the solution, not the empathy."},
                {"role": "user", "content": "How do I export my data?"},
                {"role": "assistant", "content": "Go to Settings → Data Management → Export. Choose your date range and format (CSV or JSON), then click 'Request Export'. You'll receive a download link by email within 15 minutes. Exports are retained for 48 hours. Need a custom date range or format? I can open a data request ticket."}
            ]
        },
        # Add 48+ more examples following the same pattern...
    ]

    # Write to JSONL file
    output_path = "training_data.jsonl"
    with open(output_path, "w") as f:
        for example in examples:
            f.write(json.dumps(example) + "\n")

    print(f"Created {len(examples)} training examples in {output_path}")
    return output_path


def upload_and_fine_tune(training_file_path: str, model: str = "gpt-4o-mini-2024-07-18"):
    """Upload training data and start a fine-tuning job."""

    # Step 1: Upload training file
    print("Uploading training file...")
    with open(training_file_path, "rb") as f:
        upload_response = client.files.create(file=f, purpose="fine-tune")

    file_id = upload_response.id
    print(f"Uploaded file ID: {file_id}")

    # Step 2: Create fine-tuning job
    print("Starting fine-tuning job...")
    job = client.fine_tuning.jobs.create(
        training_file=file_id,
        model=model,
        hyperparameters={
            "n_epochs": 3,  # 3-5 epochs is typical; more risks overfitting
        }
    )

    print(f"Fine-tuning job created: {job.id}")
    print(f"Status: {job.status}")
    print(f"Estimated finish: check job status in ~15-60 minutes depending on dataset size")
    return job.id


def check_fine_tuning_status(job_id: str):
    """Check the status of a fine-tuning job."""
    job = client.fine_tuning.jobs.retrieve(job_id)
    print(f"Job {job_id}: {job.status}")

    if job.status == "succeeded":
        print(f"Fine-tuned model ID: {job.fine_tuned_model}")
        return job.fine_tuned_model

    if job.status == "failed":
        print(f"Error: {job.error}")

    return None


def use_fine_tuned_model(model_id: str, user_message: str):
    """Use the fine-tuned model for inference."""
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(model=model_id, temperature=0.2)
    # Use exactly as you would use any other model — the fine-tuned behavior is baked in
    response = llm.invoke(user_message)
    return response.content


# Fine-tuning cost estimate function
def estimate_fine_tuning_cost(num_examples: int, avg_tokens_per_example: int, n_epochs: int = 3):
    """Estimate fine-tuning cost for gpt-4o-mini."""
    total_training_tokens = num_examples * avg_tokens_per_example * n_epochs
    cost_per_1k_tokens = 0.003  # gpt-4o-mini fine-tuning cost as of 2026
    estimated_cost = (total_training_tokens / 1000) * cost_per_1k_tokens

    print(f"Training tokens: {total_training_tokens:,}")
    print(f"Estimated training cost: ${estimated_cost:.2f}")
    print(f"Note: This excludes ongoing inference cost of the fine-tuned model")
    return estimated_cost


if __name__ == "__main__":
    # Estimate cost before committing
    estimate_fine_tuning_cost(
        num_examples=100,
        avg_tokens_per_example=250,
        n_epochs=3
    )
    # Output: Training tokens: 75,000 | Estimated training cost: $0.23

Approach 3: Context Injection#

For small, stable reference documents that fit in the context window, just inject them directly into the system prompt. No vector database, no embedding, no training required.

# context_injection.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

load_dotenv()

def load_context_document(file_path: str) -> str:
    """Load a small document to inject as context."""
    with open(file_path, "r") as f:
        return f.read()


def build_context_injection_agent(context_document: str):
    """
    Build an agent that answers from injected context.

    Appropriate for:
    - Documents under ~15 pages (fits in gpt-4o's 128K context window)
    - Static reference content (doesn't change daily)
    - High-precision tasks where you want the full document in view

    NOT appropriate for:
    - Large document sets (100+ documents)
    - Documents that update frequently
    - Cost-sensitive deployments (full context injected on every call)
    """
    llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

    system_prompt = (
        "You are a knowledgeable assistant. Answer questions using ONLY the reference "
        "document provided below. Do not use general knowledge or make assumptions "
        "beyond what is explicitly stated in the document. If the answer is not "
        "in the document, say 'This is not covered in our documentation.'\n\n"
        f"REFERENCE DOCUMENT:\n{'-'*40}\n{context_document}\n{'-'*40}"
    )

    def answer(question: str) -> str:
        response = llm.invoke([
            SystemMessage(content=system_prompt),
            HumanMessage(content=question)
        ])
        return response.content

    return answer


if __name__ == "__main__":
    # Example: small internal FAQ document
    sample_context = """
    ACME ENTERPRISE: FREQUENTLY ASKED QUESTIONS

    Q: What payment methods do you accept?
    A: We accept ACH bank transfer, wire transfer, and major credit cards (Visa, Mastercard, Amex).
    Annual contracts over $50,000 are invoiced quarterly. Credit card payments incur a 2.9% processing fee.

    Q: What is your data retention policy?
    A: Customer data is retained for 7 years from contract end date per SOC 2 requirements.
    Customers can request earlier deletion via a Data Deletion Request form.

    Q: Do you offer a free trial?
    A: Yes. 14-day full-access trial for teams under 25 seats. Enterprise trials (25+ seats)
    require a brief qualification call with our team.
    """

    agent = build_context_injection_agent(sample_context)

    print(agent("What credit cards do you accept?"))
    print(agent("Can I get a free trial for 50 people?"))
    print(agent("What is your pricing for the starter plan?"))
    # Last question: "This is not covered in our documentation."

Choosing the Right Approach#

Use this decision tree:

How large is your data?
├── Under 20 pages, rarely changes → CONTEXT INJECTION (simplest)
├── 20+ pages OR changes frequently → Continue...
│
Does the agent need to know FACTS from documents, or BEHAVE differently?
├── Know facts (product docs, policies, support articles) → RAG
├── Behave differently (tone, format, specialized style) → FINE-TUNING
│
For RAG: How often does your data change?
├── Daily/weekly (live docs, support tickets) → RAG with regular re-ingestion pipeline
├── Monthly/quarterly (policies, product docs) → RAG with scheduled refresh

For most product and business applications, RAG is the correct answer 85% of the time. Reserve fine-tuning for genuine behavioral customization needs.

Testing and Evaluating Your Agent#

Never ship a RAG agent without evaluation:

# evaluate_rag.py
def evaluate_rag_agent(agent_fn, eval_set: list[dict]) -> dict:
    """
    Run an evaluation set against your RAG agent.

    eval_set format:
    [{"question": "...", "expected_answer": "...", "expected_source": "..."}]
    """
    results = []

    for item in eval_set:
        result = agent_fn(item["question"])
        # Simple keyword overlap metric (replace with LLM-as-judge for production)
        expected_keywords = set(item["expected_answer"].lower().split())
        actual_keywords = set(result["result"].lower().split())
        overlap = len(expected_keywords & actual_keywords) / len(expected_keywords)

        results.append({
            "question": item["question"],
            "pass": overlap > 0.4,
            "overlap_score": overlap
        })

    pass_rate = sum(r["pass"] for r in results) / len(results)
    print(f"Evaluation: {pass_rate:.0%} pass rate on {len(eval_set)} questions")
    return {"pass_rate": pass_rate, "results": results}

For more on evaluation approaches and production deployment, see Build an AI Agent with LangChain for the agent infrastructure foundation. For multi-agent systems that combine RAG agents with other specialized agents, see LangGraph Tutorial: Build Multi-Agent Workflows and AI Agent Orchestration.

The Introduction to RAG for AI Agents tutorial covers vector search and retrieval optimization in greater depth if you want to go deeper on the RAG approach.