Datadog aggregates metrics, logs, traces, and events from hundreds of integrations into a single observability platform. The challenge isn't collecting the data — it's making sense of it fast enough to prevent small anomalies from becoming major outages. AI agents connected to Datadog close this gap: they can query multiple signals simultaneously, correlate metric spikes with deployment events, and surface actionable findings in plain language before a human engineer has opened a single dashboard.
For SREs, platform engineers, and infrastructure teams drowning in alert volume, Datadog AI integration transforms monitoring from passive dashboards into proactive intelligence.
What AI Agents Can Do With Datadog Access#
Infrastructure Intelligence
- Query CPU, memory, and network metrics across all hosts to identify outliers
- Detect services approaching capacity limits before they become incidents
- Compare infrastructure performance before and after a deployment
- Generate a daily infrastructure health summary in plain language
Monitor and Alert Management
- List all currently triggered monitors ranked by alert count and priority
- Mute monitors automatically during planned maintenance windows
- Identify monitors with consistently high false-positive rates for tuning
- Create new monitors based on patterns observed in metric queries
Log Analysis
- Search for specific error patterns across service logs without opening the Datadog UI
- Count error occurrences by service over a configurable time window
- Extract the most common error messages from a log search for root cause analysis
- Correlate log error spikes with metric anomalies during incident investigation
Setting Up Datadog API Access#
pip install datadog-api-client langchain langchain-openai python-dotenv
Get your keys from Datadog → Organization Settings:
- API Keys: Create a key labeled "ai-agent"
- Application Keys: Create a key labeled "ai-agent" with your user's scopes
export DATADOG_API_KEY="your-api-key"
export DATADOG_APP_KEY="your-application-key"
export DATADOG_SITE="datadoghq.com" # or datadoghq.eu for EU region
Test your connection:
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.authentication_api import AuthenticationApi
configuration = Configuration()
configuration.api_key["apiKeyAuth"] = os.getenv("DATADOG_API_KEY")
configuration.api_key["appKeyAuth"] = os.getenv("DATADOG_APP_KEY")
with ApiClient(configuration) as api_client:
api_instance = AuthenticationApi(api_client)
response = api_instance.validate()
print(f"Connected — API key valid: {response.get('valid', False)}")
Option 1: No-Code with n8n#
Infrastructure Daily Health Report Workflow#
- Schedule Trigger: Every morning at 8am
- HTTP Request: Datadog Metrics API — query
avg:system.cpu.user{*}andavg:system.mem.used{*}for the past 24 hours - HTTP Request: Datadog Monitors API — fetch all monitors in Alert or Warn state
- Code node: Identify hosts with CPU > 80% or memory > 90%, count triggered monitors by service
- OpenAI: "Write a 5-bullet infrastructure health summary. Flag any critical capacity issues. Recommend one preventive action."
- Slack: Post to
#infra-opschannel
For Datadog in n8n, use HTTP Request nodes with DD-API-KEY and DD-APPLICATION-KEY headers — no native node required for most Datadog operations.
Option 2: LangChain with Python#
Build Datadog Tools#
import os
import time
import requests
from datetime import datetime, timedelta, timezone
from langchain.tools import tool
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("DATADOG_API_KEY")
APP_KEY = os.getenv("DATADOG_APP_KEY")
SITE = os.getenv("DATADOG_SITE", "datadoghq.com")
DD_BASE = f"https://api.{SITE}"
def dd_headers() -> dict:
"""Return Datadog authentication headers."""
return {
"DD-API-KEY": API_KEY,
"DD-APPLICATION-KEY": APP_KEY,
"Content-Type": "application/json"
}
def dd_get(path: str, params: dict = None) -> dict:
"""Execute a Datadog API GET request."""
resp = requests.get(f"{DD_BASE}{path}", headers=dd_headers(), params=params or {})
resp.raise_for_status()
return resp.json()
def dd_post(path: str, json_data: dict = None) -> dict:
"""Execute a Datadog API POST request."""
resp = requests.post(f"{DD_BASE}{path}", headers=dd_headers(), json=json_data or {})
resp.raise_for_status()
return resp.json()
@tool
def query_metrics(metric_query: str, hours: int = 1) -> str:
"""
Query Datadog metrics using Datadog query language.
metric_query: e.g., 'avg:system.cpu.user{*}' or 'sum:nginx.net.request_per_s{env:prod}'
hours: lookback window in hours (default 1).
"""
now = int(time.time())
start = now - (hours * 3600)
data = dd_get("/api/v1/query", {
"from": start,
"to": now,
"query": metric_query
})
series = data.get("series", [])
if not series:
return f"No data returned for query: {metric_query}"
lines = [f"Metric query: {metric_query} (last {hours}h)"]
for s in series[:5]: # Show up to 5 series
scope = s.get("scope", "global")
points = s.get("pointlist", [])
if points:
values = [p[1] for p in points if p[1] is not None]
if values:
avg = sum(values) / len(values)
max_val = max(values)
min_val = min(values)
lines.append(f" {scope}: avg={avg:.2f}, max={max_val:.2f}, min={min_val:.2f}")
return "\n".join(lines)
@tool
def get_triggered_monitors(priority: str = None) -> str:
"""
List monitors currently in Alert or Warn state.
priority: filter by 'alert' or 'warn' (optional, returns both if not specified).
"""
params = {"monitor_states": "Alert,Warn", "page_size": 50}
if priority:
params["monitor_states"] = priority.title()
data = dd_get("/api/v1/monitor", params=params)
monitors = data if isinstance(data, list) else data.get("monitors", [])
# Filter to only those with triggered state
triggered = [m for m in monitors if m.get("overall_state") in ("Alert", "Warn", "No Data")]
if not triggered:
return "No monitors currently in Alert or Warn state"
lines = [f"Triggered monitors ({len(triggered)}):"]
for mon in triggered[:20]:
name = mon.get("name", "Unnamed")[:70]
state = mon.get("overall_state", "Unknown")
mon_id = mon.get("id")
query = mon.get("query", "")[:80]
lines.append(f" [{state}] {name}\n ID: {mon_id} | Query: {query}")
return "\n".join(lines)
@tool
def mute_monitor(monitor_id: int, duration_hours: int = 1, message: str = "") -> str:
"""
Mute a Datadog monitor for a specified duration (e.g., during planned maintenance).
monitor_id: the numeric monitor ID.
duration_hours: how many hours to mute (default 1).
"""
end_time = int(time.time()) + (duration_hours * 3600)
payload = {"end": end_time}
if message:
payload["message"] = message
dd_post(f"/api/v1/monitor/{monitor_id}/mute", payload)
return (f"Monitor {monitor_id} muted for {duration_hours} hour(s). "
f"Auto-unmutes at {datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M UTC')}")
@tool
def search_logs(query: str, hours: int = 1, limit: int = 50) -> str:
"""
Search Datadog logs for a query pattern.
query: Datadog log search query (e.g., 'service:api status:error').
hours: lookback window in hours.
"""
now = datetime.now(timezone.utc)
start = now - timedelta(hours=hours)
payload = {
"filter": {
"query": query,
"from": start.strftime("%Y-%m-%dT%H:%M:%SZ"),
"to": now.strftime("%Y-%m-%dT%H:%M:%SZ")
},
"page": {"limit": limit},
"sort": "timestamp"
}
data = dd_post("/api/v2/logs/events/search", payload)
logs = data.get("data", [])
if not logs:
return f"No logs found for query: '{query}' in the last {hours}h"
lines = [f"Log search: '{query}' — {len(logs)} results (last {hours}h):"]
# Count messages by frequency
message_counts = {}
for log in logs:
attrs = log.get("attributes", {})
msg = attrs.get("message", "")[:100]
message_counts[msg] = message_counts.get(msg, 0) + 1
lines.append("\nTop error patterns:")
for msg, count in sorted(message_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
lines.append(f" ({count}x) {msg}")
return "\n".join(lines)
@tool
def get_service_summary(service_name: str, hours: int = 24) -> str:
"""
Get APM performance summary for a service including request rate, error rate, and latency.
service_name: the Datadog service name (e.g., 'web-api', 'checkout-service').
"""
now = int(time.time())
start = now - (hours * 3600)
scope = f"service:{service_name}"
metrics = {
"Request rate (req/s)": f"sum:trace.web.request.hits{{{scope}}}.as_rate()",
"Error rate (%)": f"sum:trace.web.request.errors{{{scope}}}.as_rate()",
"P95 latency (ms)": f"p95:trace.web.request.duration{{{scope}}}"
}
lines = [f"Service summary: {service_name} (last {hours}h):"]
for label, query in metrics.items():
try:
data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
series = data.get("series", [])
if series:
points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
if points:
avg = sum(points) / len(points)
lines.append(f" {label}: {avg:.2f}")
else:
lines.append(f" {label}: No data")
except Exception as e:
lines.append(f" {label}: Error querying ({str(e)[:50]})")
return "\n".join(lines)
@tool
def get_infrastructure_overview(hours: int = 1) -> str:
"""Get a high-level overview of infrastructure health: CPU, memory, and disk across all hosts."""
now = int(time.time())
start = now - (hours * 3600)
infra_queries = {
"Avg CPU %": "avg:system.cpu.user{*}",
"Avg Memory %": "avg:system.mem.pct_usable{*}",
"Hosts reporting": "avg:system.cpu.user{*} by {host}"
}
lines = [f"Infrastructure overview (last {hours}h):"]
for label, query in infra_queries.items():
try:
data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
series = data.get("series", [])
if label == "Hosts reporting":
lines.append(f" Active hosts: {len(series)}")
elif series:
points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
if points:
avg = sum(points) / len(points)
lines.append(f" {label}: {avg:.1f}%")
except Exception:
pass
return "\n".join(lines)
Datadog Observability Agent#
from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [query_metrics, get_triggered_monitors, mute_monitor,
search_logs, get_service_summary, get_infrastructure_overview]
prompt = ChatPromptTemplate.from_messages([
("system", """You are an infrastructure observability assistant with access to Datadog.
When investigating issues:
1. Start with triggered monitors to understand the current alert state
2. Query specific metrics to quantify the problem scope (affected hosts, error rates, latency)
3. Search logs to find error messages and patterns that explain the root cause
4. Correlate metric and log signals to build a complete picture
5. Muting monitors requires explicit confirmation of monitor ID — never mute without verifying
For metric queries, use Datadog query syntax:
- avg:metric_name{scope} — average across matching hosts
- sum:metric_name{*}.as_rate() — rate of change
- p95:metric_name{env:prod} — 95th percentile latency"""),
("human", "{input}"),
("placeholder", "{agent_scratchpad}"),
])
agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=8)
# Infrastructure investigation
result = executor.invoke({
"input": "Check the current infrastructure health: any triggered monitors, overall CPU and memory usage, and search for any ERROR logs in the api service from the last hour."
})
print(result["output"])
Rate Limits and Best Practices#
| Datadog API limit | Value |
|---|---|
| Metrics query rate | 400 req/hour |
| Log search rate | 300 req/hour |
| Monitor list rate | 600 req/hour |
| Max metrics per query | 300 series |
Best practices:
- Scope metric queries: Always filter with
{host:name}or{service:name}tags rather than{*}on large infrastructure — wildcard queries can return hundreds of series and slow responses - Use
as_rate()for counters: Counter metrics like request counts need.as_rate()in the query to get per-second rates instead of cumulative totals - Cache metric results for dashboards: For recurring summary queries (every 5 minutes), cache results client-side rather than re-querying Datadog on each agent invocation
- Tag your mutes with messages: Always include a message when muting monitors explaining why — the
messagefield appears in the audit trail and helps teammates understand muted alerts
Next Steps#
- AI Agents PagerDuty Integration — Correlate Datadog alerts with PagerDuty incident management
- AI Agents Slack Integration — Route Datadog alerts and AI summaries to Slack channels
- AI Agents GitHub Integration — Correlate metric regressions with recent code deployments
- Build an AI Agent with LangChain — Complete agent framework tutorial