How is Anthropic's Computer Use implemented?

Anthropic's Computer Use API allows Claude to take screenshots, analyze them to understand the current state of the screen, and generate actions (mouse clicks, keyboard input, scrolling). It loops between observation and action until the task is complete.

What are the risks of AI computer use?

Key risks include the agent taking unintended actions, making irreversible changes (deleting files, submitting forms), being manipulated by malicious content on screen, and accessing unauthorized information. Always run computer-use agents in sandboxed or monitored environments.

Macbook Pro on desk with code visible on screen — Photo by Christopher Gower on Unsplash

What Is Computer Use in AI Agents?

Q: What is computer use in AI?

Computer use refers to the ability of an AI agent to control a computer's graphical user interface — moving the mouse, clicking buttons, typing text, and interpreting screenshots — to complete tasks that normally require human-computer interaction.

Quick Definition#

Computer use is the capability of an AI agent to interact with a computer's visual interface — taking screenshots, moving a cursor, clicking UI elements, typing text, and observing results — the way a human operator would. Unlike traditional tool calling, which requires API integrations, computer use lets an agent work with any application that has a screen, with no custom integration code required.

For broader context on what AI agents can do, start with What Are AI Agents? then explore Browser Use as a related but more constrained capability. For a complete index of AI agent terminology, see the AI Agent Glossary.

Why Computer Use Matters#

Most enterprise software has no API. Legacy ERPs, government portals, internal desktop tools, and proprietary SaaS platforms often expose their functionality only through a visual interface. For years, this was an insurmountable barrier for automation — either the application offered an API, or automation was not possible.

Computer use changes this equation. An AI agent with screen access can:

Navigate software that predates APIs
Complete forms and workflows in applications that resist API integration
Observe and act on any visual output — error dialogs, modal windows, dynamic content
Combine multiple applications in a workflow, switching between desktop apps the same way a human would

This capability is particularly valuable in industries with legacy infrastructure: healthcare (EMR systems), logistics (tracking portals), government (regulatory filing interfaces), and finance (desktop trading platforms).

How Computer Use Works#

The fundamental loop of a computer use agent:

Capture: Take a screenshot of the current screen state
Perceive: Send the screenshot to a multimodal LLM that can understand visual UI elements
Reason: The model decides what action achieves the current task goal
Act: Execute the action — mouse move, click, keyboard input, or scroll
Observe: Capture another screenshot to verify the action's effect
Iterate: Continue until the task is complete or a stop condition is reached

The developer provides the infrastructure for capturing screenshots and executing mouse/keyboard actions. The LLM provides the visual understanding and decision-making.

The Action Space#

A computer use agent's action space includes:

Action	Description
`screenshot`	Capture current screen state
`mouse_move`	Move cursor to coordinates (x, y)
`left_click`	Click at current or specified position
`right_click`	Open context menu
`double_click`	Double-click to open items
`type`	Type text string
`key`	Press keyboard shortcuts (e.g., `ctrl+c`)
`scroll`	Scroll in a direction at position
`drag`	Click-and-drag between coordinates

Claude Computer Use: A Concrete Implementation#

Anthropic released computer use capability in Claude in October 2024. Claude models with this capability treat screenshots as visual input and output structured action commands:

import anthropic
import base64

client = anthropic.Anthropic()

# Provide current screen state as an image
with open("screenshot.png", "rb") as f:
    screenshot_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1280,
            "display_height_px": 800,
            "display_number": 1,
        }
    ],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                },
            },
            {
                "type": "text",
                "text": "Click on the 'Submit' button in the form."
            }
        ]
    }],
)

Claude returns a structured action (left_click at specific coordinates), which the developer executes using a library like pyautogui or within a sandboxed virtual machine.

Computer Use vs. Browser Use#

Dimension	Computer Use	Browser Use
Scope	Any desktop application	Web browsers only
Interface	Visual GUI, any application	DOM + web rendering
Speed	Slower (screenshot-based perception)	Faster (direct DOM access)
Reliability	Lower — layout-dependent	Higher — structured page model
Setup complexity	Higher (VM, display server)	Lower (headless browser)
Best for	Legacy desktop apps, no-API systems	Web workflows, form filling

For web-focused tasks, Browser Use is generally more reliable and faster. Computer use is appropriate when desktop applications or non-browser interfaces are involved.

Real-World Examples#

Legacy ERP data entry#

A logistics company processes 200 shipment records per day into a legacy ERP system with no API. A computer use agent reads from a spreadsheet, navigates to the ERP's data entry form, fills in each field, submits the record, and captures a confirmation screenshot. What took a data entry clerk 4 hours now runs overnight unattended.

Healthcare prior authorization#

A medical billing team submits prior authorizations through an insurance portal that offers no API integration. A computer use agent navigates the multi-step form, uploads supporting documents, and tracks submission status — following the same workflow a human staff member would, but without requiring manual execution.

Competitive price monitoring#

A retail team tracks competitor pricing on websites that block scraping APIs. A computer use agent navigates product pages as a browser session would, captures price information visually, and logs it to a spreadsheet for analysis.

Risks and Safety Considerations#

Computer use agents require careful security design:

Prompt injection risk: Malicious content on screen (e.g., hidden text instructing the agent to send data to an external URL) can hijack agent behavior. Mitigation: define explicit task scope and validate actions against expected workflows.

Irreversible actions: A mistaken click on "Delete All" has immediate consequences. Mitigation: use human-in-the-loop checkpoints for destructive or high-stakes actions, and run agents in sandboxed environments.

Credential exposure: If an agent is logged into sensitive systems, it can inadvertently expose those credentials through screenshots or actions. Mitigation: use dedicated agent accounts with minimum necessary permissions, and never expose production credentials to agent environments.

Over-broad access: An agent running on a full desktop has access to everything visible. Mitigation: run agents in isolated virtual machines with access limited to specific applications and network destinations.

When to Use Computer Use#

✅ Good fit:

Legacy systems with no API
Applications where API integration is cost-prohibitive
Supervised workflows where human review occurs before final submission
Internal enterprise tools with complex visual interfaces

❌ Avoid when:

An API exists and is accessible (browser automation or direct API is faster and more reliable)
The task involves financial transactions or data deletion without a review step
The interface changes frequently (UI layout changes break coordinate-based actions)

Common Misconceptions#

Misconception: Computer use agents are production-ready for any task Computer use reliability improves dramatically when scoped narrowly. General-purpose autonomous computer use remains research-grade. Specific, well-defined tasks with clear success criteria and human checkpoints are where it delivers value today.

Misconception: Computer use is the same as RPA Traditional RPA (Robotic Process Automation) uses deterministic scripts that click at fixed coordinates. Computer use agents use visual understanding to decide where to click dynamically, adapting to changing UI states without requiring script updates.

Misconception: Computer use requires no engineering investment Building reliable computer use workflows requires sandboxed execution environments, screen capture infrastructure, error handling for unexpected states, and retry logic. The implementation investment is real.

Browser Use — Computer use restricted to web browsers
Action Space — The set of actions available to an agent
Tool Calling — The API-based alternative to computer use
Human-in-the-Loop — Adding human review to agent workflows
AI Agents — The broader category computer use agents belong to
AI Agent Security Best Practices — Securing agents with system-level access
AI Agents vs Chatbots — How computer use agents differ from conversational bots

Frequently Asked Questions#

What is computer use in AI agents?#

Computer use is an AI agent's ability to perceive and interact with any computer screen — clicking, typing, scrolling, and reading visual output — without requiring API integration. The agent works the way a human operator would.

How does Claude's computer use work?#

Claude receives screenshots as visual input, interprets the UI state, and outputs structured action commands (click coordinates, keyboard input). The developer's infrastructure executes those actions and provides the next screenshot, creating an action loop.

What is the difference between computer use and browser use?#

Browser use focuses exclusively on web browsers with faster, more reliable DOM-based interaction. Computer use applies to any visual interface, including desktop applications, but with lower reliability due to screenshot-based perception and coordinate-dependent clicking.

Is computer use reliable enough for production use?#

Computer use is production-appropriate for narrow, supervised workflows with human checkpoints on high-stakes actions. General autonomous computer use across arbitrary interfaces remains unreliable due to layout variation and error recovery challenges.

What Is Computer Use in AI Agents?

Quick Definition#

Why Computer Use Matters#

Computer use changes this equation. An AI agent with screen access can:

Navigate software that predates APIs
Complete forms and workflows in applications that resist API integration
Observe and act on any visual output — error dialogs, modal windows, dynamic content
Combine multiple applications in a workflow, switching between desktop apps the same way a human would

How Computer Use Works#

The fundamental loop of a computer use agent:

Capture: Take a screenshot of the current screen state
Perceive: Send the screenshot to a multimodal LLM that can understand visual UI elements
Reason: The model decides what action achieves the current task goal
Act: Execute the action — mouse move, click, keyboard input, or scroll
Observe: Capture another screenshot to verify the action's effect
Iterate: Continue until the task is complete or a stop condition is reached

The developer provides the infrastructure for capturing screenshots and executing mouse/keyboard actions. The LLM provides the visual understanding and decision-making.

The Action Space#

A computer use agent's action space includes:

Action	Description
`screenshot`	Capture current screen state
`mouse_move`	Move cursor to coordinates (x, y)
`left_click`	Click at current or specified position
`right_click`	Open context menu
`double_click`	Double-click to open items
`type`	Type text string
`key`	Press keyboard shortcuts (e.g., `ctrl+c`)
`scroll`	Scroll in a direction at position
`drag`	Click-and-drag between coordinates

Claude Computer Use: A Concrete Implementation#

Anthropic released computer use capability in Claude in October 2024. Claude models with this capability treat screenshots as visual input and output structured action commands:

import anthropic
import base64

client = anthropic.Anthropic()

# Provide current screen state as an image
with open("screenshot.png", "rb") as f:
    screenshot_b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=[
        {
            "type": "computer_20241022",
            "name": "computer",
            "display_width_px": 1280,
            "display_height_px": 800,
            "display_number": 1,
        }
    ],
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": screenshot_b64,
                },
            },
            {
                "type": "text",
                "text": "Click on the 'Submit' button in the form."
            }
        ]
    }],
)

Claude returns a structured action (left_click at specific coordinates), which the developer executes using a library like pyautogui or within a sandboxed virtual machine.

Computer Use vs. Browser Use#

Dimension	Computer Use	Browser Use
Scope	Any desktop application	Web browsers only
Interface	Visual GUI, any application	DOM + web rendering
Speed	Slower (screenshot-based perception)	Faster (direct DOM access)
Reliability	Lower — layout-dependent	Higher — structured page model
Setup complexity	Higher (VM, display server)	Lower (headless browser)
Best for	Legacy desktop apps, no-API systems	Web workflows, form filling

For web-focused tasks, Browser Use is generally more reliable and faster. Computer use is appropriate when desktop applications or non-browser interfaces are involved.

Real-World Examples#

Legacy ERP data entry#

Healthcare prior authorization#

Competitive price monitoring#

Risks and Safety Considerations#

Computer use agents require careful security design:

When to Use Computer Use#

✅ Good fit:

Legacy systems with no API
Applications where API integration is cost-prohibitive
Supervised workflows where human review occurs before final submission
Internal enterprise tools with complex visual interfaces

❌ Avoid when:

An API exists and is accessible (browser automation or direct API is faster and more reliable)
The task involves financial transactions or data deletion without a review step
The interface changes frequently (UI layout changes break coordinate-based actions)

Common Misconceptions#

Browser Use — Computer use restricted to web browsers
Action Space — The set of actions available to an agent
Tool Calling — The API-based alternative to computer use
Human-in-the-Loop — Adding human review to agent workflows
AI Agents — The broader category computer use agents belong to
AI Agent Security Best Practices — Securing agents with system-level access
AI Agents vs Chatbots — How computer use agents differ from conversational bots

Term Snapshot

What Is Computer Use in AI Agents?

Quick Definition#

Why Computer Use Matters#

How Computer Use Works#

The Action Space#

Claude Computer Use: A Concrete Implementation#

Computer Use vs. Browser Use#

Real-World Examples#

Legacy ERP data entry#

Healthcare prior authorization#

Competitive price monitoring#

Risks and Safety Considerations#

When to Use Computer Use#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is computer use in AI agents?#

How does Claude's computer use work?#

What is the difference between computer use and browser use?#

Is computer use reliable enough for production use?#

Term Snapshot

What Is Computer Use in AI Agents?

Quick Definition#

Why Computer Use Matters#

How Computer Use Works#

The Action Space#

Claude Computer Use: A Concrete Implementation#

Computer Use vs. Browser Use#

Real-World Examples#

Legacy ERP data entry#

Healthcare prior authorization#

Competitive price monitoring#

Risks and Safety Considerations#

When to Use Computer Use#

Common Misconceptions#

Related Terms#

Frequently Asked Questions#

What is computer use in AI agents?#

How does Claude's computer use work?#

What is the difference between computer use and browser use?#

Is computer use reliable enough for production use?#