Introduction#
AI agents introduce a new class of security risk that traditional application security practices were not designed to handle. When an LLM can read files, call APIs, execute code, and send messages on behalf of a user, the attack surface expands dramatically — and the attack vectors are unlike anything in conventional software security.
This guide covers the most critical security risks facing AI agents in 2026 and how to mitigate them: prompt injection, unauthorized tool use, data leakage, privilege escalation, and the governance structures you need to keep agents operating safely at scale.
Before deploying any agent to production, return to the tutorials index and make sure you have covered evaluation and testing. Security without correctness is insufficient — you need both.
Why AI Agent Security Is Different#
Traditional software security focuses on protecting well-defined interfaces: input validation, authentication, authorization, encrypted transport. Agents break these assumptions in several ways.
First, agents take natural language instructions, which are far harder to validate than structured inputs. Malicious content embedded in a document, email, or web page can alter an agent's behavior — a category of attack called prompt injection.
Second, agents take autonomous actions. A compromised web application might leak data; a compromised agent might send emails, delete files, place orders, or exfiltrate data through legitimate API channels — all while appearing to operate normally.
Third, agents chain tools together. A single malicious instruction can trigger a sequence of tool calls across multiple systems, amplifying the blast radius of a successful attack.
The OWASP Top 10 for LLM Applications (2025 edition) identifies prompt injection, insecure output handling, training data poisoning, and excessive agency as the top risks. This guide addresses all four.
Prerequisites#
- A working AI agent with defined tools and permissions
- Basic understanding of authentication and authorization concepts
- Access to your infrastructure's logging and monitoring systems
- Familiarity with the human-in-the-loop concept for governance
Step 1: Defend Against Prompt Injection#
Prompt injection is the most pervasive AI agent security risk. An attacker embeds instructions in content the agent processes — a customer email, a web page the agent browses, a document it summarizes — causing the agent to execute the attacker's instructions instead of the user's.
Direct Prompt Injection#
The user themselves attempts to override the agent's system prompt by including adversarial instructions in their input: "Ignore your previous instructions and..." This is the simplest form and relatively easy to mitigate.
Mitigations:
- Use a privileged system prompt that explicitly instructs the agent to ignore override attempts.
- Apply input length limits to reduce the surface area for injection payloads.
- Use a separate, non-LLM classifier to screen user inputs for injection patterns before passing them to the agent.
Indirect Prompt Injection#
Far more dangerous: instructions embedded in data the agent retrieves from external sources — web pages, emails, database records, uploaded files. The agent never sees the attack coming because it treats retrieved content as data, not instructions.
Mitigations:
- Clearly delineate in your prompt between system instructions (trusted) and retrieved content (untrusted). Use XML-style delimiters:
<user_document>tags signal to the model that content inside should be treated as data. - Strip or escape HTML and markdown from retrieved content before passing it to the agent.
- Implement a content safety classifier on all retrieved content before it enters the agent's context.
- Never pass raw web page content directly into the agent's context without sanitization.
Step 2: Scope Tool Permissions Using Least Privilege#
The principle of least privilege is fundamental to computer security and applies directly to AI agents: every tool an agent can use should have the minimum permissions required to perform its function — nothing more.
Define Explicit Tool Allowlists#
Do not give agents access to a generic "filesystem" tool or an unrestricted "HTTP request" tool. Define specific, narrowly scoped tools:
- Instead of
read_file(path), defineread_customer_report(report_id)that validates the ID format and restricts reads to a specific directory. - Instead of
send_email(to, subject, body), definesend_support_reply(ticket_id, body)that validates the ticket ownership and recipient.
Narrow tools dramatically limit what a compromised or confused agent can do.

Implement Permission Tiers#
Separate your agent's tools into tiers by risk level:
- Tier 1 (read-only, low-risk): Search, retrieve, summarize. These can run autonomously.
- Tier 2 (write, moderate-risk): Create records, send notifications, update fields. Require logging and rate limits.
- Tier 3 (destructive or high-value): Delete records, send external communications, make purchases, modify permissions. Require human-in-the-loop confirmation.
Never allow Tier 3 actions without a human confirmation step, regardless of how confident the agent appears to be.
Step 3: Implement Secrets Management#
AI agents frequently need API keys, database credentials, and authentication tokens. Hardcoding these into prompts or tool configurations is a critical vulnerability — one that exposes credentials to every system that handles the agent's context.
What Not To Do#
- Never include API keys or credentials in system prompts.
- Never pass secrets as tool parameters visible in the agent's context.
- Never log agent traces that might contain credentials from tool responses.
Secrets Management Pattern#
Use a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault, Azure Key Vault). Tools should retrieve credentials at runtime from the secrets manager, never from the agent's context. The agent knows to call a tool; the tool handles authentication internally.
Agent → calls tool("send_email", {to: ..., body: ...})
Tool → retrieves SMTP credentials from Vault at runtime
Tool → sends email using retrieved credentials
Tool → returns status to agent (credentials never enter context)
This pattern ensures credentials are never exposed in agent traces, logs, or LLM context windows.
Step 4: Build Audit Logging and Anomaly Detection#
Every action an AI agent takes should be logged with enough context to reconstruct what happened and why. This is your security audit trail — essential for incident response and compliance.
What to Log#
For every agent run, log:
- The user or system that initiated the run
- The full sequence of tool calls made (tool name, parameters, response)
- The total tokens consumed and APIs called
- The final action taken and its outcome
- Any errors or unexpected states
Do not log raw user data or retrieved content that might contain PII. Log references and metadata, not content.
Anomaly Detection#
Define baseline behavior for your agents and alert on deviations:
- Unusual tool call sequences (an agent suddenly calling
delete_filewhen it normally only reads) - Volume spikes (an agent sending 10x the normal number of API calls)
- Off-hours activity (an agent running at 3am when users are not active)
- Cross-tenant access patterns (an agent accessing records from multiple different customers in one run)
Integrate these alerts into your existing security monitoring infrastructure (SIEM, PagerDuty, etc.).
Step 5: Enforce Output Sanitization and Content Safety#
Agent outputs go somewhere — into emails, customer-facing interfaces, databases, other systems. Unsanitized outputs can introduce secondary vulnerabilities.
Prevent Prompt Injection in Outputs#
If your agent's output is used as input to another system or agent, it can carry injected instructions forward. Sanitize outputs before they enter any downstream pipeline, especially in multi-agent architectures. See the LangGraph multi-agent tutorial for how this compounds in chained agent systems.
Content Safety Filtering#
Apply content safety classifiers to agent outputs before they reach end users. This catches hallucinated harmful content, inadvertent PII disclosure, and outputs that violate your usage policies. Most LLM providers offer safety APIs; supplement with dedicated classifiers (Guardrails AI, NeMo Guardrails) for higher-risk use cases.
Step 6: Implement Human-in-the-Loop Guardrails#
Human-in-the-loop is not just a UX pattern — it is a security control. For high-risk actions, requiring human confirmation is one of the most effective ways to prevent autonomous agents from causing irreversible harm.
Define your approval thresholds clearly:
- Actions that are reversible and low-value: autonomous, no approval needed.
- Actions that affect external parties (sending emails, posting content): require user confirmation.
- Actions that are irreversible or high-value (deletion, financial transactions, permission changes): require explicit human approval with a review window.
Build escalation paths: when an agent encounters an action it cannot determine is safe, it should pause and route to a human reviewer rather than proceeding autonomously.
For deploying these controls at scale, see how to deploy AI agents in your company.
Common Mistakes in AI Agent Security#
Treating the system prompt as a security boundary. The system prompt is guidance for the LLM, not a security enforcement layer. It can be overridden by sufficiently clever injection attacks. Real security controls live outside the LLM — in tool permission systems, input validators, and output classifiers.
Overly broad tool permissions. Giving an agent access to an unrestricted file system or database "for convenience" creates massive attack surface. Scope every tool to the minimum required access.
No logging on tool calls. Agents that take actions without logging are impossible to audit. Implement logging before deployment, not after an incident.
Assuming the model will refuse harmful requests. Model refusals are inconsistent and bypassable. Never rely on model refusal as your only security control.
Ignoring indirect prompt injection from external data. Most teams think about users injecting prompts, not attackers embedding instructions in documents or web pages the agent retrieves. Indirect injection is harder to detect and more dangerous.
Best Practices Summary#
- Apply least-privilege scoping to every tool — narrow tool definitions over generic ones.
- Use delimiter-based prompt structure to separate instructions from untrusted content.
- Manage secrets outside the agent's context using a dedicated secrets manager.
- Log every tool call with parameters and outcomes for audit trail coverage.
- Require human confirmation for all irreversible or high-value actions.
- Sanitize all retrieved content before it enters the agent's context window.
- Apply output safety classifiers before agent responses reach users or downstream systems.
- Monitor for anomalous behavior patterns and alert on deviations from baseline.
For framework-level security features, compare your options in the open-source vs commercial AI agent frameworks comparison.
Conclusion#
Securing AI agents requires thinking differently about trust, permissions, and attack vectors. The core principles remain the same as traditional security — least privilege, defense in depth, audit logging, anomaly detection — but the implementation details are unique to the agentic context.
Start with prompt injection defenses and tool permission scoping. Add secrets management and audit logging before any production deployment. Then layer in human-in-the-loop guardrails and output safety for higher-risk use cases. Security is not a feature you add at the end — build it into your agent architecture from day one.
Frequently Asked Questions#
What is prompt injection and why is it the top AI agent security risk?
Prompt injection is an attack where malicious instructions are embedded in content the agent processes — a document, web page, or email — causing the agent to execute the attacker's instructions instead of the user's. It is the top risk because it bypasses all conventional authentication and authorization controls: the agent willingly executes the injected instructions because it cannot distinguish them from legitimate instructions.
Can I rely on the LLM's built-in safety refusals to prevent harmful actions?
No. Model refusals are a last-resort safeguard, not a security control. They are inconsistent across inputs and can be bypassed with sufficiently crafted prompts. Real security controls must be implemented in the tool permission layer, input validation, and output filtering — outside the LLM itself.
How do I handle secrets like API keys that an agent's tools need?
Never pass secrets through the agent's context or system prompt. Use a dedicated secrets manager (AWS Secrets Manager, HashiCorp Vault). Tools should retrieve credentials at runtime from the secrets manager using their own service identity, keeping secrets completely out of the LLM's context window and agent logs.
What actions should always require human approval before an AI agent executes them?
Any irreversible action should require human approval: permanent deletion, financial transactions, sending external communications on behalf of the organization, modifying user permissions, and publishing content publicly. The test is: if the agent makes a mistake, can you undo it? If not, require approval.