Build a Computer Use Agent with Claude
Computer use is one of the most remarkable capabilities in modern AI: an agent that can see your screen, move the mouse, type text, click buttons, and navigate applications — just like a human operator would. Anthropic's Claude 3.5 Sonnet and later models support this through the computer use API, enabling truly general desktop automation.
Unlike traditional RPA (Robotic Process Automation) tools that require brittle selector-based scripts, a computer use agent understands the visual context of any screen and can adapt to UI changes, unexpected dialogs, and novel applications without reprogramming.
In this tutorial you will build a computer use agent that can automate web research tasks: open a browser, navigate to websites, extract information, fill forms, and save results to a file — all by seeing and interacting with the actual screen.
What You'll Learn#
- How computer use works under the hood (screenshot → model → action loop)
- How to set up a safe sandboxed environment for computer use
- How to implement the computer use action types: screenshot, mouse, keyboard, bash
- How to build the control loop that drives the agent
- How to add safety guardrails and human-in-the-loop approval
- Important security considerations for production deployments
Prerequisites#
- Python 3.10 or higher installed
- An Anthropic API key with computer use access
pyautoguiorxdotoolfor desktop control (Linux/macOS)- Basic understanding of AI agents and agentic loops
- Important: Run this in a sandboxed VM or Docker container, never on your primary machine
Step 1: Safety First — Environment Setup#
Before writing a single line of code, set up a sandboxed environment. Computer use agents have full control over whatever machine they run on. Never run them on your primary computer with access to sensitive files, credentials, or production systems.
Recommended setups:
- A Docker container (Anthropic provides an official computer use Docker image)
- A dedicated virtual machine with a fresh OS install
- A cloud VM with no sensitive data or credentials
# Option A: Use Anthropic's official Docker image (recommended)
docker pull anthropics/claude-computer-use:latest
docker run -p 5900:5900 -p 8080:8080 anthropics/claude-computer-use:latest
# Option B: Install dependencies on a fresh VM
pip install anthropic pyautogui pillow python-dotenv
# On Linux, also install: xdotool scrot
Create .env:
ANTHROPIC_API_KEY=sk-ant-...your-key...
Step 2: Implement the Computer Use Tools#
The Anthropic computer use API defines four built-in tool types. You implement the execution logic for each:
# computer_tools.py
import subprocess
import base64
import io
import os
from PIL import Image
def take_screenshot() -> str:
"""Capture the current screen state and return as base64-encoded PNG.
Returns:
Base64-encoded PNG string of the current screen.
"""
try:
import pyautogui
screenshot = pyautogui.screenshot()
# Resize to a reasonable resolution for the model
screenshot = screenshot.resize((1280, 800), Image.LANCZOS)
buffer = io.BytesIO()
screenshot.save(buffer, format="PNG")
return base64.standard_b64encode(buffer.getvalue()).decode("utf-8")
except Exception as e:
raise RuntimeError(f"Screenshot failed: {e}")
def mouse_move(x: int, y: int) -> None:
"""Move the mouse cursor to screen coordinates (x, y)."""
import pyautogui
pyautogui.moveTo(x, y, duration=0.3)
def mouse_click(x: int, y: int, button: str = "left", click_type: str = "click") -> None:
"""Click at screen coordinates.
Args:
x: X coordinate on screen.
y: Y coordinate on screen.
button: 'left', 'right', or 'middle'.
click_type: 'click' for single, 'double_click' for double.
"""
import pyautogui
if click_type == "double_click":
pyautogui.doubleClick(x, y, button=button)
else:
pyautogui.click(x, y, button=button)
def type_text(text: str) -> None:
"""Type text using the keyboard.
Args:
text: The text string to type.
"""
import pyautogui
pyautogui.typewrite(text, interval=0.05)
def press_key(key: str) -> None:
"""Press a special keyboard key.
Args:
key: Key name (e.g., 'enter', 'tab', 'escape', 'ctrl+c').
"""
import pyautogui
pyautogui.hotkey(*key.split("+")) if "+" in key else pyautogui.press(key)
def run_bash_command(command: str) -> str:
"""Execute a shell command and return stdout.
Args:
command: Shell command to execute.
Returns:
Command stdout as a string.
"""
# IMPORTANT: In production, add strict command allowlist validation
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=30,
)
output = result.stdout
if result.returncode != 0:
output += f"\nSTDERR: {result.stderr}"
return output
Step 3: Build the Agent Control Loop#
The core of a computer use agent is the agentic loop: take screenshot → send to model → model returns action → execute action → repeat until done.
# computer_agent.py
import anthropic
import base64
import os
from dotenv import load_dotenv
from computer_tools import (
take_screenshot,
mouse_move,
mouse_click,
type_text,
press_key,
run_bash_command,
)
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
# Define the computer use tools in Anthropic's format
COMPUTER_USE_TOOLS = [
{
"type": "computer_20241022",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
"display_number": 1,
},
{
"type": "bash_20241022",
"name": "bash",
},
]
def execute_computer_action(tool_use_block) -> str:
"""Execute a computer use action from the model's tool call.
Args:
tool_use_block: The tool use block from Claude's response.
Returns:
Result string to send back to the model.
"""
tool_name = tool_use_block.name
tool_input = tool_use_block.input
if tool_name == "computer":
action = tool_input.get("action")
if action == "screenshot":
screenshot_b64 = take_screenshot()
# Return the screenshot as an image content block
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_b64,
},
}
],
}
elif action == "mouse_move":
mouse_move(tool_input["coordinate"][0], tool_input["coordinate"][1])
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Mouse moved successfully.",
}
elif action in ("left_click", "right_click", "double_click", "middle_click"):
button_map = {
"left_click": "left", "right_click": "right",
"middle_click": "middle", "double_click": "left"
}
click_type = "double_click" if action == "double_click" else "click"
mouse_click(
tool_input["coordinate"][0],
tool_input["coordinate"][1],
button=button_map[action],
click_type=click_type,
)
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": f"{action} performed successfully.",
}
elif action == "type":
type_text(tool_input["text"])
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": "Text typed successfully.",
}
elif action == "key":
press_key(tool_input["key"])
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": f"Key '{tool_input['key']}' pressed.",
}
elif tool_name == "bash":
output = run_bash_command(tool_input["command"])
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": output or "(no output)",
}
return {
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": f"Unknown action: {tool_name}",
"is_error": True,
}
Step 4: The Main Agent Loop with Human Approval#
# run_agent.py
import time
from anthropic import Anthropic
from anthropic.types.beta import BetaMessage
from computer_agent import execute_computer_action, COMPUTER_USE_TOOLS
client = Anthropic()
SYSTEM_PROMPT = """You are a computer use agent with access to a desktop environment.
You can see the screen via screenshots and interact with applications using mouse, keyboard, and bash commands.
Guidelines:
- Always take a screenshot first to understand the current screen state
- Take screenshots after each action to confirm the result
- Be careful with destructive actions — ask for confirmation before deleting files
- If you encounter an error, take a screenshot and reassess
- Complete tasks efficiently, but accurately
Current environment: Linux desktop with Firefox browser and terminal available."""
def run_computer_agent(task: str, max_steps: int = 30, require_approval: bool = True):
"""Run the computer use agent on a given task.
Args:
task: Natural language description of the task to complete.
max_steps: Maximum number of action steps before stopping.
require_approval: If True, prompt user before each action.
"""
messages = [{"role": "user", "content": task}]
step = 0
print(f"\nStarting computer use agent for task:\n{task}\n{'='*60}\n")
while step < max_steps:
step += 1
print(f"Step {step}/{max_steps}")
# Call Claude with computer use enabled
response = client.beta.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
tools=COMPUTER_USE_TOOLS,
messages=messages,
betas=["computer-use-2024-10-22"],
system=SYSTEM_PROMPT,
)
print(f"Stop reason: {response.stop_reason}")
# Add Claude's response to message history
messages.append({
"role": "assistant",
"content": response.content,
})
# If Claude is done, exit the loop
if response.stop_reason == "end_turn":
# Extract final text response
for block in response.content:
if hasattr(block, "text"):
print(f"\n=== Task Complete ===\n{block.text}")
break
# Process tool use blocks
tool_results = []
for block in response.content:
if block.type == "tool_use":
print(f"\nAction requested: {block.name} → {block.input}")
# Human-in-the-loop approval gate
if require_approval:
approval = input("Approve this action? [y/n/q(quit)]: ").strip().lower()
if approval == "q":
print("Stopping agent.")
return
if approval != "y":
print("Action skipped.")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": "Action was rejected by the user.",
"is_error": True,
})
continue
# Execute the action
result = execute_computer_action(block)
tool_results.append(result)
# Small delay to let UI settle
time.sleep(0.5)
if tool_results:
messages.append({
"role": "user",
"content": tool_results,
})
if step >= max_steps:
print(f"Reached maximum steps ({max_steps}). Stopping agent.")
if __name__ == "__main__":
run_computer_agent(
task="Open Firefox, navigate to news.ycombinator.com (Hacker News), "
"find the top 3 posts, and save their titles and URLs to a file called hacker_news_top3.txt on the desktop.",
require_approval=True, # Set to False for fully autonomous operation
)
Step 5: Security Guardrails#
Production computer use agents require strict security controls:
# security.py
import re
from typing import Optional
# Commands that are never allowed
BLOCKED_COMMANDS = [
r"rm\s+-rf\s+/", # Wipe filesystem
r"sudo\s+rm", # Sudo remove
r":\(\)\{.*\};:", # Fork bomb
r"mkfs\.", # Format disk
r"dd\s+if=.*of=/dev", # Overwrite disk
r"curl.*\|\s*sh", # Pipe to shell
r"wget.*\|\s*sh", # Pipe to shell
r"chmod\s+777", # World-writable
]
def is_safe_command(command: str) -> tuple[bool, Optional[str]]:
"""Check if a bash command is safe to execute.
Args:
command: Shell command string.
Returns:
Tuple of (is_safe, reason_if_blocked).
"""
for pattern in BLOCKED_COMMANDS:
if re.search(pattern, command, re.IGNORECASE):
return False, f"Command matches blocked pattern: {pattern}"
# Additional checks for the specific environment
if "/etc/passwd" in command or "/etc/shadow" in command:
return False, "Access to sensitive system files is blocked"
if "~/.ssh" in command or ".aws" in command:
return False, "Access to credential files is blocked"
return True, None
def sanitize_screenshot_for_logging(screenshot_b64: str) -> str:
"""Truncate screenshot data in logs to avoid filling log files."""
return f"[screenshot:{len(screenshot_b64)} bytes]"
What's Next#
You now have a foundation for building computer use agents. Key next steps:
- Browser automation alternative: For web-only automation, building a browser use agent is safer and more reliable
- LangChain agent: Learn building with LangChain for tool-based agents that don't require screen access
- OpenAI Agents SDK: Compare with the OpenAI Agents SDK approach for structured tool calling
- Understand agent types: Read the AI agents glossary entry for context on where computer use fits in the agent taxonomy
- MCP integration: See connecting agents to MCP servers for extending computer use agents with external capabilities
Frequently Asked Questions#
Which Claude models support computer use?
Computer use is supported on Claude 3.5 Sonnet (claude-3-5-sonnet-20241022 and newer), Claude 3.5 Haiku, and Claude 3.7 Sonnet. Claude 3.7 Sonnet with extended thinking provides the most reliable computer use performance for complex multi-step tasks.
What screen resolution should I use?
Anthropic recommends keeping screenshots at approximately 1280x800 pixels or smaller. Higher resolutions increase token costs (images are priced by pixel count) and can cause the model to miss UI elements. The model was trained at typical monitor resolutions, so stay within common sizes.
How do I handle CAPTCHAs and human verification checks?
Computer use agents will encounter CAPTCHAs. Claude will recognize them and typically stop and ask for human intervention. Build a notification system that alerts a human operator when intervention is needed, then resumes the agent after the human completes the challenge.
Is computer use safe for production automation?
Use computer use in production only with: (1) a strictly sandboxed environment with no access to sensitive systems, (2) human-in-the-loop approval for destructive or irreversible actions, (3) comprehensive audit logging of all actions, and (4) rate limiting and kill switches. Never give a computer use agent access to banking, email, or production databases.
What are the costs for computer use?
Computer use costs more than standard Claude API calls because screenshots are large images. A typical computer use session with 20-30 steps uses 50,000-200,000 input tokens. Use claude-3-5-haiku for cost-sensitive applications where task complexity is lower.