How to Build a Research AI Agent with Python
Research is one of the highest-value tasks you can automate with an AI agent. A research agent can autonomously search the web, read multiple sources, cross-reference facts, and produce a structured report with citations — work that would take a human analyst 2–4 hours done in 2–4 minutes.
In this tutorial, you'll build a fully functional research AI agent using LangChain, the Tavily search API, and GPT-4o. By the end, you'll have a working agent that takes a research question, gathers information from multiple web sources, and returns a formatted markdown report with citations.
What a Research AI Agent Does#
A research agent follows this autonomous reasoning loop:
- Plans the research approach (what to search for, what aspects to cover)
- Searches the web using a search API to find relevant sources
- Extracts content from pages that look promising
- Synthesizes information across multiple sources, resolving contradictions
- Formats a structured output with proper citations
This is distinct from a simple search wrapper. The agent makes decisions about what to search for next based on what it has already found, and it evaluates source quality before including information in its report. See what AI agents are for more on the autonomous decision-making that makes this possible.
Architecture Overview#
User query
│
▼
Research Agent (ReAct loop)
│
├── Tavily Search Tool ──────► Search results (URLs + snippets)
│
├── Web Extraction Tool ─────► Full page text
│
├── Citation Formatter Tool ─► Formatted reference list
│
└── LLM Reasoning (GPT-4o) ──► Decision: search more, extract, or synthesize
│
▼
Structured Markdown Report (with citations)
The AI agent framework here is LangChain's ReAct agent, which alternates between reasoning ("what should I do next?") and acting ("invoke the search tool with this query").
Prerequisites#
- Python 3.10 or higher
- An OpenAI API key
- A Tavily API key (free tier at tavily.com)
- Basic understanding of LangChain (see the LangChain tutorial if you're new to it)
Step 1: Install Dependencies#
Create a virtual environment and install the required packages:
python -m venv research-agent
source research-agent/bin/activate # Windows: research-agent\Scripts\activate
pip install langchain langchain-openai langchain-community \
tavily-python requests beautifulsoup4 \
python-dotenv
Create a .env file with your API keys:
OPENAI_API_KEY=your_openai_key_here
TAVILY_API_KEY=your_tavily_key_here
Step 2: Build the Search Tool#
The search tool uses Tavily to find relevant sources. Tavily returns clean, structured results including URL, title, content snippet, and relevance score — making it ideal for LLM consumption:
import os
from dotenv import load_dotenv
from langchain_community.tools.tavily_search import TavilySearchResults
load_dotenv()
def create_search_tool():
"""Create a Tavily search tool configured for research tasks."""
search_tool = TavilySearchResults(
max_results=6,
search_depth="advanced", # More thorough search
include_answer=True, # Include Tavily's AI-generated answer
include_raw_content=False, # We'll fetch full content separately
include_images=False,
name="web_search",
description=(
"Search the web for current information. "
"Use this tool when you need to find facts, data, or sources on a topic. "
"Input should be a specific search query string."
)
)
return search_tool
# Test it
if __name__ == "__main__":
tool = create_search_tool()
results = tool.invoke("AI agent deployment enterprise 2025")
for r in results[:2]:
print(r['url'])
print(r['content'][:200])
print("---")
Step 3: Build the Web Content Extraction Tool#
For sources that need deeper reading, the extraction tool fetches full page content:
import requests
from bs4 import BeautifulSoup
from langchain.tools import tool
@tool
def extract_page_content(url: str) -> str:
"""
Extract the main text content from a web page URL.
Use this after finding a promising source in search results to read
its full content. Input must be a valid URL string.
"""
try:
headers = {
"User-Agent": (
"Mozilla/5.0 (compatible; ResearchBot/1.0; "
"+https://example.com/research-bot)"
)
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Remove non-content elements
for tag in soup(["script", "style", "nav", "footer",
"header", "aside", "form", "iframe"]):
tag.decompose()
# Extract main content
main = soup.find("main") or soup.find("article") or soup.find("body")
if not main:
return f"ERROR: Could not find main content at {url}"
text = main.get_text(separator="\n", strip=True)
# Truncate to avoid excessive tokens
max_chars = 4000
if len(text) > max_chars:
text = text[:max_chars] + f"\n\n[Content truncated at {max_chars} chars]"
return f"SOURCE: {url}\n\n{text}"
except requests.exceptions.Timeout:
return f"ERROR: Timeout fetching {url}"
except requests.exceptions.RequestException as e:
return f"ERROR: Could not fetch {url}: {str(e)}"
Step 4: Build the Citation Formatter Tool#
This tool structures collected sources into a consistent citation format for the final report:
from langchain.tools import tool
from typing import List
import json
# Session-level citation registry (reset per research run)
_citation_registry = []
@tool
def add_citation(url: str, title: str, key_fact: str) -> str:
"""
Register a source as a citation for the research report.
Call this whenever you use information from a source.
Args:
url: The full URL of the source
title: The title or headline of the source page
key_fact: One sentence describing what this source contributed
Returns a citation ID to use in the report text like [1], [2], etc.
"""
global _citation_registry
# Deduplicate by URL
existing_urls = [c["url"] for c in _citation_registry]
if url in existing_urls:
idx = existing_urls.index(url) + 1
return f"Citation already registered as [{idx}]"
citation_id = len(_citation_registry) + 1
_citation_registry.append({
"id": citation_id,
"url": url,
"title": title,
"key_fact": key_fact
})
return f"Registered as citation [{citation_id}]"
def get_formatted_citations() -> str:
"""Return all registered citations as a formatted reference list."""
if not _citation_registry:
return "No citations registered."
lines = ["## Sources\n"]
for c in _citation_registry:
lines.append(f"[{c['id']}] {c['title']}\n URL: {c['url']}\n Used for: {c['key_fact']}")
return "\n\n".join(lines)
def reset_citations():
"""Clear citations for a new research session."""
global _citation_registry
_citation_registry = []
Step 5: Assemble the Research Agent#
Now combine the tools into a ReAct agent. The system prompt is critical — it tells the agent how to use the tools and what quality standards to hold itself to:
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent
from langchain.prompts import PromptTemplate
from langchain_community.tools.tavily_search import TavilySearchResults
RESEARCH_SYSTEM_PROMPT = """You are a rigorous research assistant. Your job is to research
a topic thoroughly and produce a well-structured report with citations.
You have access to these tools:
{tools}
Tool names: {tool_names}
RESEARCH PROCESS:
1. Start with 2-3 broad searches to map the topic landscape
2. Identify the most credible and relevant sources
3. Extract full content from the 3-4 most important sources
4. Register citations for every source you use with add_citation
5. Synthesize information into a structured report
QUALITY RULES:
- Only cite sources you have actually retrieved (via web_search or extract_page_content)
- Never invent URLs or statistics
- If sources contradict each other, note the disagreement
- Prefer recent sources (published within 2 years)
- Acknowledge when information is limited or uncertain
OUTPUT FORMAT:
Your final response must be a markdown report with:
- An executive summary (3-5 sentences)
- Key findings organized by theme (use ## headings)
- Inline citations using [1], [2], etc. notation
- A limitations section (what you couldn't find or verify)
Use this format for your reasoning:
Thought: [what you're thinking]
Action: [tool name]
Action Input: [tool input]
Observation: [tool result]
... (repeat as needed)
Thought: I now have enough information to write the report.
Final Answer: [your complete research report]
Begin!
Research question: {input}
{agent_scratchpad}"""
def create_research_agent():
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
search_tool = TavilySearchResults(
max_results=6,
search_depth="advanced",
name="web_search",
description="Search the web. Input: search query string."
)
tools = [search_tool, extract_page_content, add_citation]
prompt = PromptTemplate.from_template(RESEARCH_SYSTEM_PROMPT)
agent = create_react_agent(llm=llm, tools=tools, prompt=prompt)
executor = AgentExecutor(
agent=agent,
tools=tools,
verbose=True,
max_iterations=12,
handle_parsing_errors=True,
return_intermediate_steps=False
)
return executor
Step 6: Add Source Deduplication and Fact Verification#
Before the agent synthesizes, add a validation pass that checks for duplicated sources and flags low-confidence claims:
def validate_research_output(report_text: str, citations: list) -> dict:
"""
Post-process research output to validate quality.
Returns a dict with validation results and cleaned report.
"""
issues = []
# Check citation count
if len(citations) < 3:
issues.append(f"Low source count: only {len(citations)} sources cited")
# Check for citation references in text
import re
cited_in_text = set(re.findall(r'\[(\d+)\]', report_text))
registered_ids = {str(c['id']) for c in citations}
uncited = registered_ids - cited_in_text
if uncited:
issues.append(f"Citations registered but not referenced in text: {uncited}")
phantom = cited_in_text - registered_ids
if phantom:
issues.append(f"Citations referenced in text but not registered: {phantom}")
return {
"valid": len(issues) == 0,
"issues": issues,
"source_count": len(citations),
"report": report_text
}
Step 7: Output a Structured Research Report#
Wrap everything in a clean runner function:
def run_research(question: str, output_file: str = None) -> str:
"""
Run the full research pipeline on a question.
Args:
question: The research question to investigate
output_file: Optional path to save the markdown report
Returns:
Formatted markdown research report with citations
"""
print(f"Starting research: {question}\n{'='*60}")
# Reset citations for this session
reset_citations()
# Run the agent
agent = create_research_agent()
result = agent.invoke({"input": question})
report_text = result["output"]
# Append formatted citations
citations_section = get_formatted_citations()
full_report = f"{report_text}\n\n{citations_section}"
# Validate output
validation = validate_research_output(report_text, _citation_registry)
if not validation["valid"]:
print(f"WARNING: Research quality issues detected:")
for issue in validation["issues"]:
print(f" - {issue}")
# Save to file if requested
if output_file:
with open(output_file, "w", encoding="utf-8") as f:
f.write(f"# Research Report\n**Question:** {question}\n\n")
f.write(full_report)
print(f"Report saved to {output_file}")
return full_report
# Example usage
if __name__ == "__main__":
report = run_research(
question="What are the key enterprise use cases for AI agents in 2025-2026?",
output_file="research_report.md"
)
print(report)
Use Case Examples#
Competitor research: "What are the key differentiators of [Competitor X]'s product offering compared to alternatives?" — The agent searches product pages, review sites, and analyst commentary to produce a competitive brief.
Market analysis: "What is the current state of the AI agent infrastructure market — key players, growth trends, and investor activity in 2025-2026?" — The agent synthesizes news, funding announcements, and analyst reports.
Content research: "What are the most commonly cited challenges when deploying LLMs in regulated industries?" — The agent gathers perspectives from industry publications, conference proceedings, and practitioner blogs.
These are exactly the kinds of tasks covered in the business use cases guide.
Limitations and How to Handle Them#
Paywalled content. The agent will retrieve only what is publicly accessible. For academic papers, use the arXiv API or a university library API as an additional tool.
Stale cached results. Web searches may return content that has since been updated. For time-sensitive research, add date filters to your Tavily queries using the days parameter.
Hallucinated citations. The primary defense is the citation registry pattern from Step 4 — the agent can only cite URLs that were actually retrieved during the session. Combine with the validation function from Step 6.
Rate limits. If running many research queries, implement rate limiting between runs. Tavily's free tier allows 1,000 searches per month.
What to Read Next#
- See how AutoGen handles multi-agent research workflows where different agents specialize in search vs. synthesis
- Read the deploy AI agents in company guide to take this research agent from prototype to production
- Explore AI agent platform comparisons to evaluate hosted research agent solutions
- Learn about human-in-the-loop design for cases where the research agent's output needs expert review before use