Build a Browser Automation Agent with browser-use
Traditional web automation with Playwright or Selenium requires writing explicit selectors, handling dynamic content, and maintaining brittle scripts that break whenever a website updates its design. browser-use is a Python library that solves this by combining a real browser (Playwright) with an AI model that can see the page, understand its content, and take the right action — even on sites it has never seen before.
In this tutorial you will build a browser agent that can autonomously research job listings, extract structured data, and submit forms — all guided by natural language instructions rather than hardcoded selectors.
What You'll Learn#
- How to install and configure
browser-usewith different AI model backends - How to run basic and multi-step browser automation tasks
- How to extract structured data from web pages
- How to handle authentication, forms, and multi-page workflows
- How to add custom browser actions to extend the agent's capabilities
- How to run agents headlessly in production environments
Prerequisites#
- Python 3.11 or higher installed (browser-use requires 3.11+)
- An OpenAI or Anthropic API key
- Basic understanding of AI agents and web automation concepts
- No Playwright knowledge required — browser-use handles the browser layer
Step 1: Project Setup#
mkdir browser-agent-demo && cd browser-agent-demo
python -m venv .venv && source .venv/bin/activate
# Install browser-use and model provider
pip install browser-use langchain-openai python-dotenv
# Install Playwright browser binaries
playwright install chromium
Create .env:
OPENAI_API_KEY=sk-...your-key...
# Or for Anthropic:
# ANTHROPIC_API_KEY=sk-ant-...
Step 2: Your First Browser Agent#
A minimal browser-use agent is just three lines of code:
# simple_agent.py
import asyncio
from dotenv import load_dotenv
from browser_use import Agent
from langchain_openai import ChatOpenAI
load_dotenv()
async def main():
agent = Agent(
task="Go to reddit.com/r/MachineLearning and find the top post from today. "
"Return the title, author, and score.",
llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()
print(result)
if __name__ == "__main__":
asyncio.run(main())
Run it:
python simple_agent.py
A Chrome window will open, the agent will navigate to Reddit, identify the top post, and return the structured result — all without a single CSS selector.
Step 3: Structured Data Extraction#
For production use cases you often need reliably structured output. Use Pydantic models to define the exact shape of data you want extracted:
# structured_extraction.py
import asyncio
from dotenv import load_dotenv
from pydantic import BaseModel
from browser_use import Agent
from browser_use.browser.browser import Browser, BrowserConfig
from langchain_openai import ChatOpenAI
load_dotenv()
class JobListing(BaseModel):
"""Structured representation of a job listing."""
title: str
company: str
location: str
salary_range: str | None
job_type: str # full-time, part-time, contract
posted_date: str
key_requirements: list[str]
apply_url: str
class JobSearchResults(BaseModel):
"""Container for multiple job listings."""
search_query: str
total_found: int
listings: list[JobListing]
async def search_jobs():
# Configure browser for production use
browser = Browser(
config=BrowserConfig(
headless=True, # No visible browser window
disable_security=False, # Keep security features on
extra_chromium_args=["--no-sandbox"],
)
)
agent = Agent(
task="""Go to linkedin.com/jobs and search for 'AI Engineer' jobs in 'San Francisco, CA'.
Extract details for the first 5 job listings including: title, company, location,
salary range (if shown), job type, and posting date.
Also save the URL for each job's apply button.""",
llm=ChatOpenAI(model="gpt-4o"),
browser=browser,
)
result = await agent.run()
# The result is a string — you can parse it or use output_model for typed results
print("Extracted Job Listings:")
print(result)
await browser.close()
if __name__ == "__main__":
asyncio.run(search_jobs())
Step 4: Multi-Step Web Workflows#
browser-use excels at multi-step tasks that involve navigating across multiple pages, filling forms, and maintaining context throughout:
# multi_step_workflow.py
import asyncio
from dotenv import load_dotenv
from browser_use import Agent, Controller
from browser_use.browser.browser import Browser, BrowserConfig
from langchain_openai import ChatOpenAI
load_dotenv()
async def research_and_compare():
"""Multi-step research workflow across multiple websites."""
browser = Browser(config=BrowserConfig(headless=False))
# Controller lets you add custom actions and capture intermediate state
controller = Controller()
@controller.action("Save research note")
def save_note(note: str) -> str:
"""Save a research note for later compilation.
Args:
note: The research note text to save.
"""
# In production, write to database or file
print(f"[NOTE SAVED]: {note}")
return f"Note saved: {note[:50]}..."
agent = Agent(
task="""Research task: Compare the Python package 'httpx' and 'requests' libraries.
Steps:
1. Go to pypi.org/project/httpx and note the download stats and key features
2. Go to pypi.org/project/requests and note the download stats and key features
3. Visit the GitHub repositories for both and check star counts and recent activity
4. Use the 'Save research note' action to record key findings for each library
5. Provide a final comparison summary
Be thorough and accurate — only report what you actually see on the pages.""",
llm=ChatOpenAI(model="gpt-4o"),
browser=browser,
controller=controller,
)
result = await agent.run(max_steps=25)
print("\n=== Final Research Summary ===")
print(result)
await browser.close()
if __name__ == "__main__":
asyncio.run(research_and_compare())
Step 5: Handling Authentication and Sessions#
Many real-world tasks require logging in first. browser-use supports persistent browser contexts that preserve cookies and authentication state:
# authenticated_agent.py
import asyncio
from pathlib import Path
from dotenv import load_dotenv
from browser_use import Agent
from browser_use.browser.browser import Browser, BrowserConfig
from browser_use.browser.context import BrowserContextConfig
from langchain_anthropic import ChatAnthropic # Using Claude for this example
load_dotenv()
# Path to save browser session state (cookies, localStorage)
SESSION_PATH = Path("./browser_session")
SESSION_PATH.mkdir(exist_ok=True)
async def run_with_session():
"""Run an agent that preserves login state across runs."""
browser = Browser(
config=BrowserConfig(
headless=False,
new_context_config=BrowserContextConfig(
# Persist session to disk so you only log in once
storage_state=str(SESSION_PATH / "session.json")
if (SESSION_PATH / "session.json").exists()
else None,
save_storage_state=True,
),
)
)
# First run: agent will navigate to the site and handle login if needed
agent = Agent(
task="""Go to github.com. If you're not logged in, go to github.com/login
and log in with the credentials: username 'demo_user', password 'demo_pass'.
After logging in, go to your notifications page and summarize the first 3 notifications.""",
llm=ChatAnthropic(model="claude-3-5-sonnet-20241022"),
browser=browser,
# Sensitive — don't log actions that might contain credentials
generate_gif=False,
)
result = await agent.run()
print(result)
# Save session state for next run
context = await browser.new_context()
await context.save_storage_state(str(SESSION_PATH / "session.json"))
await browser.close()
if __name__ == "__main__":
asyncio.run(run_with_session())
Step 6: Custom Browser Actions#
Extend the agent's default capabilities with custom actions using the @controller.action decorator:
# custom_actions.py
import asyncio
import csv
import json
from datetime import datetime
from pathlib import Path
from dotenv import load_dotenv
from browser_use import Agent, Controller
from browser_use.browser.browser import Browser, BrowserConfig
from langchain_openai import ChatOpenAI
load_dotenv()
controller = Controller()
collected_data = []
@controller.action("Save product data to CSV")
def save_product_to_csv(
name: str,
price: str,
rating: str,
review_count: str,
url: str,
) -> str:
"""Save a product's data to the collection for CSV export.
Args:
name: Product name.
price: Product price as displayed on the page.
rating: Star rating (e.g., '4.5 out of 5').
review_count: Number of reviews.
url: Product page URL.
"""
collected_data.append({
"name": name,
"price": price,
"rating": rating,
"review_count": review_count,
"url": url,
"scraped_at": datetime.now().isoformat(),
})
return f"Saved: {name} at {price}"
@controller.action("Export data to file")
def export_to_csv(filename: str = "products.csv") -> str:
"""Export all collected product data to a CSV file.
Args:
filename: Output filename for the CSV.
"""
if not collected_data:
return "No data collected yet."
output_path = Path(filename)
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=collected_data[0].keys())
writer.writeheader()
writer.writerows(collected_data)
return f"Exported {len(collected_data)} products to {filename}"
async def scrape_products():
browser = Browser(config=BrowserConfig(headless=True))
agent = Agent(
task="""Go to books.toscrape.com (a legal practice scraping site).
Navigate through the first 2 pages of books.
For each book, use 'Save product data to CSV' to record: name, price, rating, review count, and URL.
After processing all books, use 'Export data to file' to save as 'books_data.csv'.""",
llm=ChatOpenAI(model="gpt-4o"),
browser=browser,
controller=controller,
)
await agent.run(max_steps=40)
await browser.close()
print(f"\nFinal dataset: {len(collected_data)} products collected")
if __name__ == "__main__":
asyncio.run(scrape_products())
Step 7: Production Configuration#
For production deployments, configure browser-use for reliability and observability:
# production_config.py
from browser_use import Agent
from browser_use.browser.browser import Browser, BrowserConfig
from browser_use.browser.context import BrowserContextConfig
from langchain_openai import ChatOpenAI
def create_production_agent(task: str) -> Agent:
"""Create a production-configured browser agent."""
browser = Browser(
config=BrowserConfig(
headless=True,
disable_security=False,
extra_chromium_args=[
"--no-sandbox",
"--disable-dev-shm-usage", # Required in Docker
"--disable-gpu", # Required in headless servers
"--window-size=1920,1080",
],
new_context_config=BrowserContextConfig(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (compatible; ResearchBot/1.0)",
java_script_enabled=True,
accept_downloads=False, # Disable file downloads for security
),
)
)
return Agent(
task=task,
llm=ChatOpenAI(
model="gpt-4o",
temperature=0, # Deterministic for consistency
timeout=60,
),
browser=browser,
max_failures=3, # Retry limit per action
retry_delay=2, # Seconds between retries
generate_gif=False, # Disable GIF recording in production
)
What's Next#
You have built a capable browser automation agent that works without brittle selectors. Recommended next steps:
- Computer use agent: For full desktop automation (not just browsers), see the computer use agent tutorial
- LangChain tools: Learn how LangChain's web tools compare to browser-use
- Playwright directly: If you need deterministic automation, learn pure Playwright for cases where AI decision-making adds unnecessary latency
- MCP integration: See connecting agents to MCP servers to expose browser-use capabilities as MCP tools
- Tool use patterns: Read the tool use glossary entry for context on how the agent decides which browser actions to take
Frequently Asked Questions#
How does browser-use differ from traditional Playwright automation?
Traditional Playwright requires explicit CSS selectors, XPaths, or text matchers that break when website layouts change. browser-use uses a vision-language model to understand the current page state visually and semantically, then selects actions based on intent. This makes it far more robust to UI changes but introduces AI latency and non-determinism.
Which AI models work best with browser-use?
gpt-4o and claude-3-5-sonnet are the most reliable choices due to their strong visual understanding capabilities. gpt-4o-mini works for simpler tasks but struggles with complex layouts or ambiguous UI states. Gemini 2.0 Flash is a cost-effective alternative for straightforward navigation tasks.
Can browser-use handle JavaScript-heavy single-page applications?
Yes. Since browser-use uses a real Chromium browser, JavaScript rendering is handled natively. The agent waits for page loads and can interact with dynamically loaded content. For very slow SPAs, increase the action timeout in BrowserConfig.
Is browser-use appropriate for high-volume production scraping?
For high-volume (1000+ pages per day), browser-use is more expensive and slower than traditional scrapers due to AI inference costs. It is best suited for: complex interactive workflows, sites that actively block traditional scrapers, tasks requiring judgment about which elements to interact with, and low-to-medium volume data extraction.
How do I handle bot detection and CAPTCHAs?
browser-use does not bypass bot detection or CAPTCHAs by design — solving CAPTCHAs may violate terms of service. If you encounter a CAPTCHA, the agent will typically stop and report it. Provide a human operator callback or integrate a CAPTCHA service for legitimate workflows that encounter them.