🤖AI Agents Guide
TutorialsComparisonsReviewsExamplesIntegrationsUse CasesTemplatesGlossary
Get Started
🤖AI Agents Guide

Your comprehensive resource for understanding, building, and implementing AI Agents.

Learn

  • Tutorials
  • Glossary
  • Use Cases
  • Examples

Compare

  • Tool Comparisons
  • Reviews
  • Integrations
  • Templates

Company

  • About
  • Contact
  • Privacy Policy

© 2026 AI Agents Guide. All rights reserved.

Home/Integrations/How to Integrate AI Agents with Datadog
IntegrationDatadogintermediate11 min readSetup: 15-25 minutes

How to Integrate AI Agents with Datadog

Step-by-step guide to connecting AI agents with Datadog. Learn how to automate infrastructure monitoring, alert triage, log analysis, and SLA reporting using LangChain, n8n, and the Datadog REST API.

A close up of a word written in sand
Photo by Immo Wegmann on Unsplash
By AI Agents Guide Team•February 28, 2026

Table of Contents

  1. What AI Agents Can Do With Datadog Access
  2. Setting Up Datadog API Access
  3. Option 1: No-Code with n8n
  4. Infrastructure Daily Health Report Workflow
  5. Option 2: LangChain with Python
  6. Build Datadog Tools
  7. Datadog Observability Agent
  8. Rate Limits and Best Practices
  9. Next Steps
blue red and yellow lights
Photo by Imam Fadly on Unsplash

Datadog aggregates metrics, logs, traces, and events from hundreds of integrations into a single observability platform. The challenge isn't collecting the data — it's making sense of it fast enough to prevent small anomalies from becoming major outages. AI agents connected to Datadog close this gap: they can query multiple signals simultaneously, correlate metric spikes with deployment events, and surface actionable findings in plain language before a human engineer has opened a single dashboard.

For SREs, platform engineers, and infrastructure teams drowning in alert volume, Datadog AI integration transforms monitoring from passive dashboards into proactive intelligence.

What AI Agents Can Do With Datadog Access#

Infrastructure Intelligence

  • Query CPU, memory, and network metrics across all hosts to identify outliers
  • Detect services approaching capacity limits before they become incidents
  • Compare infrastructure performance before and after a deployment
  • Generate a daily infrastructure health summary in plain language

Monitor and Alert Management

  • List all currently triggered monitors ranked by alert count and priority
  • Mute monitors automatically during planned maintenance windows
  • Identify monitors with consistently high false-positive rates for tuning
  • Create new monitors based on patterns observed in metric queries

Log Analysis

  • Search for specific error patterns across service logs without opening the Datadog UI
  • Count error occurrences by service over a configurable time window
  • Extract the most common error messages from a log search for root cause analysis
  • Correlate log error spikes with metric anomalies during incident investigation

Setting Up Datadog API Access#

pip install datadog-api-client langchain langchain-openai python-dotenv

Get your keys from Datadog → Organization Settings:

  • API Keys: Create a key labeled "ai-agent"
  • Application Keys: Create a key labeled "ai-agent" with your user's scopes
export DATADOG_API_KEY="your-api-key"
export DATADOG_APP_KEY="your-application-key"
export DATADOG_SITE="datadoghq.com"  # or datadoghq.eu for EU region

Test your connection:

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.authentication_api import AuthenticationApi

configuration = Configuration()
configuration.api_key["apiKeyAuth"] = os.getenv("DATADOG_API_KEY")
configuration.api_key["appKeyAuth"] = os.getenv("DATADOG_APP_KEY")

with ApiClient(configuration) as api_client:
    api_instance = AuthenticationApi(api_client)
    response = api_instance.validate()
    print(f"Connected — API key valid: {response.get('valid', False)}")

Option 1: No-Code with n8n#

Infrastructure Daily Health Report Workflow#

  1. Schedule Trigger: Every morning at 8am
  2. HTTP Request: Datadog Metrics API — query avg:system.cpu.user{*} and avg:system.mem.used{*} for the past 24 hours
  3. HTTP Request: Datadog Monitors API — fetch all monitors in Alert or Warn state
  4. Code node: Identify hosts with CPU > 80% or memory > 90%, count triggered monitors by service
  5. OpenAI: "Write a 5-bullet infrastructure health summary. Flag any critical capacity issues. Recommend one preventive action."
  6. Slack: Post to #infra-ops channel

For Datadog in n8n, use HTTP Request nodes with DD-API-KEY and DD-APPLICATION-KEY headers — no native node required for most Datadog operations.


Option 2: LangChain with Python#

Build Datadog Tools#

import os
import time
import requests
from datetime import datetime, timedelta, timezone
from langchain.tools import tool
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("DATADOG_API_KEY")
APP_KEY = os.getenv("DATADOG_APP_KEY")
SITE = os.getenv("DATADOG_SITE", "datadoghq.com")
DD_BASE = f"https://api.{SITE}"


def dd_headers() -> dict:
    """Return Datadog authentication headers."""
    return {
        "DD-API-KEY": API_KEY,
        "DD-APPLICATION-KEY": APP_KEY,
        "Content-Type": "application/json"
    }


def dd_get(path: str, params: dict = None) -> dict:
    """Execute a Datadog API GET request."""
    resp = requests.get(f"{DD_BASE}{path}", headers=dd_headers(), params=params or {})
    resp.raise_for_status()
    return resp.json()


def dd_post(path: str, json_data: dict = None) -> dict:
    """Execute a Datadog API POST request."""
    resp = requests.post(f"{DD_BASE}{path}", headers=dd_headers(), json=json_data or {})
    resp.raise_for_status()
    return resp.json()


@tool
def query_metrics(metric_query: str, hours: int = 1) -> str:
    """
    Query Datadog metrics using Datadog query language.
    metric_query: e.g., 'avg:system.cpu.user{*}' or 'sum:nginx.net.request_per_s{env:prod}'
    hours: lookback window in hours (default 1).
    """
    now = int(time.time())
    start = now - (hours * 3600)

    data = dd_get("/api/v1/query", {
        "from": start,
        "to": now,
        "query": metric_query
    })

    series = data.get("series", [])
    if not series:
        return f"No data returned for query: {metric_query}"

    lines = [f"Metric query: {metric_query} (last {hours}h)"]
    for s in series[:5]:  # Show up to 5 series
        scope = s.get("scope", "global")
        points = s.get("pointlist", [])
        if points:
            values = [p[1] for p in points if p[1] is not None]
            if values:
                avg = sum(values) / len(values)
                max_val = max(values)
                min_val = min(values)
                lines.append(f"  {scope}: avg={avg:.2f}, max={max_val:.2f}, min={min_val:.2f}")
    return "\n".join(lines)


@tool
def get_triggered_monitors(priority: str = None) -> str:
    """
    List monitors currently in Alert or Warn state.
    priority: filter by 'alert' or 'warn' (optional, returns both if not specified).
    """
    params = {"monitor_states": "Alert,Warn", "page_size": 50}
    if priority:
        params["monitor_states"] = priority.title()

    data = dd_get("/api/v1/monitor", params=params)
    monitors = data if isinstance(data, list) else data.get("monitors", [])

    # Filter to only those with triggered state
    triggered = [m for m in monitors if m.get("overall_state") in ("Alert", "Warn", "No Data")]

    if not triggered:
        return "No monitors currently in Alert or Warn state"

    lines = [f"Triggered monitors ({len(triggered)}):"]
    for mon in triggered[:20]:
        name = mon.get("name", "Unnamed")[:70]
        state = mon.get("overall_state", "Unknown")
        mon_id = mon.get("id")
        query = mon.get("query", "")[:80]
        lines.append(f"  [{state}] {name}\n    ID: {mon_id} | Query: {query}")
    return "\n".join(lines)


@tool
def mute_monitor(monitor_id: int, duration_hours: int = 1, message: str = "") -> str:
    """
    Mute a Datadog monitor for a specified duration (e.g., during planned maintenance).
    monitor_id: the numeric monitor ID.
    duration_hours: how many hours to mute (default 1).
    """
    end_time = int(time.time()) + (duration_hours * 3600)
    payload = {"end": end_time}
    if message:
        payload["message"] = message

    dd_post(f"/api/v1/monitor/{monitor_id}/mute", payload)
    return (f"Monitor {monitor_id} muted for {duration_hours} hour(s). "
            f"Auto-unmutes at {datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M UTC')}")


@tool
def search_logs(query: str, hours: int = 1, limit: int = 50) -> str:
    """
    Search Datadog logs for a query pattern.
    query: Datadog log search query (e.g., 'service:api status:error').
    hours: lookback window in hours.
    """
    now = datetime.now(timezone.utc)
    start = now - timedelta(hours=hours)

    payload = {
        "filter": {
            "query": query,
            "from": start.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "to": now.strftime("%Y-%m-%dT%H:%M:%SZ")
        },
        "page": {"limit": limit},
        "sort": "timestamp"
    }

    data = dd_post("/api/v2/logs/events/search", payload)
    logs = data.get("data", [])

    if not logs:
        return f"No logs found for query: '{query}' in the last {hours}h"

    lines = [f"Log search: '{query}' — {len(logs)} results (last {hours}h):"]

    # Count messages by frequency
    message_counts = {}
    for log in logs:
        attrs = log.get("attributes", {})
        msg = attrs.get("message", "")[:100]
        message_counts[msg] = message_counts.get(msg, 0) + 1

    lines.append("\nTop error patterns:")
    for msg, count in sorted(message_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        lines.append(f"  ({count}x) {msg}")

    return "\n".join(lines)


@tool
def get_service_summary(service_name: str, hours: int = 24) -> str:
    """
    Get APM performance summary for a service including request rate, error rate, and latency.
    service_name: the Datadog service name (e.g., 'web-api', 'checkout-service').
    """
    now = int(time.time())
    start = now - (hours * 3600)
    scope = f"service:{service_name}"

    metrics = {
        "Request rate (req/s)": f"sum:trace.web.request.hits{{{scope}}}.as_rate()",
        "Error rate (%)": f"sum:trace.web.request.errors{{{scope}}}.as_rate()",
        "P95 latency (ms)": f"p95:trace.web.request.duration{{{scope}}}"
    }

    lines = [f"Service summary: {service_name} (last {hours}h):"]
    for label, query in metrics.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.2f}")
                else:
                    lines.append(f"  {label}: No data")
        except Exception as e:
            lines.append(f"  {label}: Error querying ({str(e)[:50]})")

    return "\n".join(lines)


@tool
def get_infrastructure_overview(hours: int = 1) -> str:
    """Get a high-level overview of infrastructure health: CPU, memory, and disk across all hosts."""
    now = int(time.time())
    start = now - (hours * 3600)

    infra_queries = {
        "Avg CPU %": "avg:system.cpu.user{*}",
        "Avg Memory %": "avg:system.mem.pct_usable{*}",
        "Hosts reporting": "avg:system.cpu.user{*} by {host}"
    }

    lines = [f"Infrastructure overview (last {hours}h):"]
    for label, query in infra_queries.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if label == "Hosts reporting":
                lines.append(f"  Active hosts: {len(series)}")
            elif series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.1f}%")
        except Exception:
            pass

    return "\n".join(lines)

Datadog Observability Agent#

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [query_metrics, get_triggered_monitors, mute_monitor,
         search_logs, get_service_summary, get_infrastructure_overview]

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an infrastructure observability assistant with access to Datadog.

When investigating issues:
1. Start with triggered monitors to understand the current alert state
2. Query specific metrics to quantify the problem scope (affected hosts, error rates, latency)
3. Search logs to find error messages and patterns that explain the root cause
4. Correlate metric and log signals to build a complete picture
5. Muting monitors requires explicit confirmation of monitor ID — never mute without verifying

For metric queries, use Datadog query syntax:
- avg:metric_name{scope} — average across matching hosts
- sum:metric_name{*}.as_rate() — rate of change
- p95:metric_name{env:prod} — 95th percentile latency"""),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=8)

# Infrastructure investigation
result = executor.invoke({
    "input": "Check the current infrastructure health: any triggered monitors, overall CPU and memory usage, and search for any ERROR logs in the api service from the last hour."
})
print(result["output"])

Rate Limits and Best Practices#

Datadog API limitValue
Metrics query rate400 req/hour
Log search rate300 req/hour
Monitor list rate600 req/hour
Max metrics per query300 series

Best practices:

  • Scope metric queries: Always filter with {host:name} or {service:name} tags rather than {*} on large infrastructure — wildcard queries can return hundreds of series and slow responses
  • Use as_rate() for counters: Counter metrics like request counts need .as_rate() in the query to get per-second rates instead of cumulative totals
  • Cache metric results for dashboards: For recurring summary queries (every 5 minutes), cache results client-side rather than re-querying Datadog on each agent invocation
  • Tag your mutes with messages: Always include a message when muting monitors explaining why — the message field appears in the audit trail and helps teammates understand muted alerts

Next Steps#

  • AI Agents PagerDuty Integration — Correlate Datadog alerts with PagerDuty incident management
  • AI Agents Slack Integration — Route Datadog alerts and AI summaries to Slack channels
  • AI Agents GitHub Integration — Correlate metric regressions with recent code deployments
  • Build an AI Agent with LangChain — Complete agent framework tutorial

Related Integrations

How to Integrate AI Agents with Airtable

Step-by-step guide to connecting AI agents with Airtable. Learn how to automate record creation, data enrichment, workflow triggers, and database management using LangChain, n8n, and the Airtable REST API.

How to Integrate AI Agents with Asana

Step-by-step guide to connecting AI agents with Asana. Learn how to automate task creation, project updates, workload analysis, and deadline tracking using LangChain, n8n, and the Asana REST API.

AI Agents + Google BigQuery: Setup Guide

Step-by-step guide to connecting AI agents with Google BigQuery. Learn how to automate SQL queries, build analytics pipelines, detect anomalies, and generate business reports using LangChain, n8n, and the BigQuery Python SDK.

← Back to All Integrations