blue red and yellow lights — Photo by Imam Fadly on Unsplash

Datadog aggregates metrics, logs, traces, and events from hundreds of integrations into a single observability platform. The challenge isn't collecting the data — it's making sense of it fast enough to prevent small anomalies from becoming major outages. AI agents connected to Datadog close this gap: they can query multiple signals simultaneously, correlate metric spikes with deployment events, and surface actionable findings in plain language before a human engineer has opened a single dashboard.

For SREs, platform engineers, and infrastructure teams drowning in alert volume, Datadog AI integration transforms monitoring from passive dashboards into proactive intelligence.

What AI Agents Can Do With Datadog Access#

Infrastructure Intelligence

Query CPU, memory, and network metrics across all hosts to identify outliers
Detect services approaching capacity limits before they become incidents
Compare infrastructure performance before and after a deployment
Generate a daily infrastructure health summary in plain language

Monitor and Alert Management

List all currently triggered monitors ranked by alert count and priority
Mute monitors automatically during planned maintenance windows
Identify monitors with consistently high false-positive rates for tuning
Create new monitors based on patterns observed in metric queries

Log Analysis

Search for specific error patterns across service logs without opening the Datadog UI
Count error occurrences by service over a configurable time window
Extract the most common error messages from a log search for root cause analysis
Correlate log error spikes with metric anomalies during incident investigation

Setting Up Datadog API Access#

pip install datadog-api-client langchain langchain-openai python-dotenv

Get your keys from Datadog → Organization Settings:

API Keys: Create a key labeled "ai-agent"
Application Keys: Create a key labeled "ai-agent" with your user's scopes

export DATADOG_API_KEY="your-api-key"
export DATADOG_APP_KEY="your-application-key"
export DATADOG_SITE="datadoghq.com"  # or datadoghq.eu for EU region

Test your connection:

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.authentication_api import AuthenticationApi

configuration = Configuration()
configuration.api_key["apiKeyAuth"] = os.getenv("DATADOG_API_KEY")
configuration.api_key["appKeyAuth"] = os.getenv("DATADOG_APP_KEY")

with ApiClient(configuration) as api_client:
    api_instance = AuthenticationApi(api_client)
    response = api_instance.validate()
    print(f"Connected — API key valid: {response.get('valid', False)}")

Option 1: No-Code with n8n#

Infrastructure Daily Health Report Workflow#

Schedule Trigger: Every morning at 8am
HTTP Request: Datadog Metrics API — query avg:system.cpu.user{*} and avg:system.mem.used{*} for the past 24 hours
HTTP Request: Datadog Monitors API — fetch all monitors in Alert or Warn state
Code node: Identify hosts with CPU > 80% or memory > 90%, count triggered monitors by service
OpenAI: "Write a 5-bullet infrastructure health summary. Flag any critical capacity issues. Recommend one preventive action."
Slack: Post to #infra-ops channel

For Datadog in n8n, use HTTP Request nodes with DD-API-KEY and DD-APPLICATION-KEY headers — no native node required for most Datadog operations.

Option 2: LangChain with Python#

Build Datadog Tools#

import os
import time
import requests
from datetime import datetime, timedelta, timezone
from langchain.tools import tool
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("DATADOG_API_KEY")
APP_KEY = os.getenv("DATADOG_APP_KEY")
SITE = os.getenv("DATADOG_SITE", "datadoghq.com")
DD_BASE = f"https://api.{SITE}"


def dd_headers() -> dict:
    """Return Datadog authentication headers."""
    return {
        "DD-API-KEY": API_KEY,
        "DD-APPLICATION-KEY": APP_KEY,
        "Content-Type": "application/json"
    }


def dd_get(path: str, params: dict = None) -> dict:
    """Execute a Datadog API GET request."""
    resp = requests.get(f"{DD_BASE}{path}", headers=dd_headers(), params=params or {})
    resp.raise_for_status()
    return resp.json()


def dd_post(path: str, json_data: dict = None) -> dict:
    """Execute a Datadog API POST request."""
    resp = requests.post(f"{DD_BASE}{path}", headers=dd_headers(), json=json_data or {})
    resp.raise_for_status()
    return resp.json()


@tool
def query_metrics(metric_query: str, hours: int = 1) -> str:
    """
    Query Datadog metrics using Datadog query language.
    metric_query: e.g., 'avg:system.cpu.user{*}' or 'sum:nginx.net.request_per_s{env:prod}'
    hours: lookback window in hours (default 1).
    """
    now = int(time.time())
    start = now - (hours * 3600)

    data = dd_get("/api/v1/query", {
        "from": start,
        "to": now,
        "query": metric_query
    })

    series = data.get("series", [])
    if not series:
        return f"No data returned for query: {metric_query}"

    lines = [f"Metric query: {metric_query} (last {hours}h)"]
    for s in series[:5]:  # Show up to 5 series
        scope = s.get("scope", "global")
        points = s.get("pointlist", [])
        if points:
            values = [p[1] for p in points if p[1] is not None]
            if values:
                avg = sum(values) / len(values)
                max_val = max(values)
                min_val = min(values)
                lines.append(f"  {scope}: avg={avg:.2f}, max={max_val:.2f}, min={min_val:.2f}")
    return "\n".join(lines)


@tool
def get_triggered_monitors(priority: str = None) -> str:
    """
    List monitors currently in Alert or Warn state.
    priority: filter by 'alert' or 'warn' (optional, returns both if not specified).
    """
    params = {"monitor_states": "Alert,Warn", "page_size": 50}
    if priority:
        params["monitor_states"] = priority.title()

    data = dd_get("/api/v1/monitor", params=params)
    monitors = data if isinstance(data, list) else data.get("monitors", [])

    # Filter to only those with triggered state
    triggered = [m for m in monitors if m.get("overall_state") in ("Alert", "Warn", "No Data")]

    if not triggered:
        return "No monitors currently in Alert or Warn state"

    lines = [f"Triggered monitors ({len(triggered)}):"]
    for mon in triggered[:20]:
        name = mon.get("name", "Unnamed")[:70]
        state = mon.get("overall_state", "Unknown")
        mon_id = mon.get("id")
        query = mon.get("query", "")[:80]
        lines.append(f"  [{state}] {name}\n    ID: {mon_id} | Query: {query}")
    return "\n".join(lines)


@tool
def mute_monitor(monitor_id: int, duration_hours: int = 1, message: str = "") -> str:
    """
    Mute a Datadog monitor for a specified duration (e.g., during planned maintenance).
    monitor_id: the numeric monitor ID.
    duration_hours: how many hours to mute (default 1).
    """
    end_time = int(time.time()) + (duration_hours * 3600)
    payload = {"end": end_time}
    if message:
        payload["message"] = message

    dd_post(f"/api/v1/monitor/{monitor_id}/mute", payload)
    return (f"Monitor {monitor_id} muted for {duration_hours} hour(s). "
            f"Auto-unmutes at {datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M UTC')}")


@tool
def search_logs(query: str, hours: int = 1, limit: int = 50) -> str:
    """
    Search Datadog logs for a query pattern.
    query: Datadog log search query (e.g., 'service:api status:error').
    hours: lookback window in hours.
    """
    now = datetime.now(timezone.utc)
    start = now - timedelta(hours=hours)

    payload = {
        "filter": {
            "query": query,
            "from": start.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "to": now.strftime("%Y-%m-%dT%H:%M:%SZ")
        },
        "page": {"limit": limit},
        "sort": "timestamp"
    }

    data = dd_post("/api/v2/logs/events/search", payload)
    logs = data.get("data", [])

    if not logs:
        return f"No logs found for query: '{query}' in the last {hours}h"

    lines = [f"Log search: '{query}' — {len(logs)} results (last {hours}h):"]

    # Count messages by frequency
    message_counts = {}
    for log in logs:
        attrs = log.get("attributes", {})
        msg = attrs.get("message", "")[:100]
        message_counts[msg] = message_counts.get(msg, 0) + 1

    lines.append("\nTop error patterns:")
    for msg, count in sorted(message_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        lines.append(f"  ({count}x) {msg}")

    return "\n".join(lines)


@tool
def get_service_summary(service_name: str, hours: int = 24) -> str:
    """
    Get APM performance summary for a service including request rate, error rate, and latency.
    service_name: the Datadog service name (e.g., 'web-api', 'checkout-service').
    """
    now = int(time.time())
    start = now - (hours * 3600)
    scope = f"service:{service_name}"

    metrics = {
        "Request rate (req/s)": f"sum:trace.web.request.hits{{{scope}}}.as_rate()",
        "Error rate (%)": f"sum:trace.web.request.errors{{{scope}}}.as_rate()",
        "P95 latency (ms)": f"p95:trace.web.request.duration{{{scope}}}"
    }

    lines = [f"Service summary: {service_name} (last {hours}h):"]
    for label, query in metrics.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.2f}")
                else:
                    lines.append(f"  {label}: No data")
        except Exception as e:
            lines.append(f"  {label}: Error querying ({str(e)[:50]})")

    return "\n".join(lines)


@tool
def get_infrastructure_overview(hours: int = 1) -> str:
    """Get a high-level overview of infrastructure health: CPU, memory, and disk across all hosts."""
    now = int(time.time())
    start = now - (hours * 3600)

    infra_queries = {
        "Avg CPU %": "avg:system.cpu.user{*}",
        "Avg Memory %": "avg:system.mem.pct_usable{*}",
        "Hosts reporting": "avg:system.cpu.user{*} by {host}"
    }

    lines = [f"Infrastructure overview (last {hours}h):"]
    for label, query in infra_queries.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if label == "Hosts reporting":
                lines.append(f"  Active hosts: {len(series)}")
            elif series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.1f}%")
        except Exception:
            pass

    return "\n".join(lines)

Datadog Observability Agent#

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [query_metrics, get_triggered_monitors, mute_monitor,
         search_logs, get_service_summary, get_infrastructure_overview]

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an infrastructure observability assistant with access to Datadog.

When investigating issues:
1. Start with triggered monitors to understand the current alert state
2. Query specific metrics to quantify the problem scope (affected hosts, error rates, latency)
3. Search logs to find error messages and patterns that explain the root cause
4. Correlate metric and log signals to build a complete picture
5. Muting monitors requires explicit confirmation of monitor ID — never mute without verifying

For metric queries, use Datadog query syntax:
- avg:metric_name{scope} — average across matching hosts
- sum:metric_name{*}.as_rate() — rate of change
- p95:metric_name{env:prod} — 95th percentile latency"""),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=8)

# Infrastructure investigation
result = executor.invoke({
    "input": "Check the current infrastructure health: any triggered monitors, overall CPU and memory usage, and search for any ERROR logs in the api service from the last hour."
})
print(result["output"])

Rate Limits and Best Practices#

Datadog API limit	Value
Metrics query rate	400 req/hour
Log search rate	300 req/hour
Monitor list rate	600 req/hour
Max metrics per query	300 series

Best practices:

Scope metric queries: Always filter with {host:name} or {service:name} tags rather than {*} on large infrastructure — wildcard queries can return hundreds of series and slow responses
Use as_rate() for counters: Counter metrics like request counts need .as_rate() in the query to get per-second rates instead of cumulative totals
Cache metric results for dashboards: For recurring summary queries (every 5 minutes), cache results client-side rather than re-querying Datadog on each agent invocation
Tag your mutes with messages: Always include a message when muting monitors explaining why — the message field appears in the audit trail and helps teammates understand muted alerts

Next Steps#

AI Agents PagerDuty Integration — Correlate Datadog alerts with PagerDuty incident management
AI Agents Slack Integration — Route Datadog alerts and AI summaries to Slack channels
AI Agents GitHub Integration — Correlate metric regressions with recent code deployments
Build an AI Agent with LangChain — Complete agent framework tutorial

For SREs, platform engineers, and infrastructure teams drowning in alert volume, Datadog AI integration transforms monitoring from passive dashboards into proactive intelligence.

What AI Agents Can Do With Datadog Access#

Infrastructure Intelligence

Query CPU, memory, and network metrics across all hosts to identify outliers
Detect services approaching capacity limits before they become incidents
Compare infrastructure performance before and after a deployment
Generate a daily infrastructure health summary in plain language

Monitor and Alert Management

List all currently triggered monitors ranked by alert count and priority
Mute monitors automatically during planned maintenance windows
Identify monitors with consistently high false-positive rates for tuning
Create new monitors based on patterns observed in metric queries

Log Analysis

Search for specific error patterns across service logs without opening the Datadog UI
Count error occurrences by service over a configurable time window
Extract the most common error messages from a log search for root cause analysis
Correlate log error spikes with metric anomalies during incident investigation

Setting Up Datadog API Access#

pip install datadog-api-client langchain langchain-openai python-dotenv

Get your keys from Datadog → Organization Settings:

API Keys: Create a key labeled "ai-agent"
Application Keys: Create a key labeled "ai-agent" with your user's scopes

export DATADOG_API_KEY="your-api-key"
export DATADOG_APP_KEY="your-application-key"
export DATADOG_SITE="datadoghq.com"  # or datadoghq.eu for EU region

Test your connection:

from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.authentication_api import AuthenticationApi

configuration = Configuration()
configuration.api_key["apiKeyAuth"] = os.getenv("DATADOG_API_KEY")
configuration.api_key["appKeyAuth"] = os.getenv("DATADOG_APP_KEY")

with ApiClient(configuration) as api_client:
    api_instance = AuthenticationApi(api_client)
    response = api_instance.validate()
    print(f"Connected — API key valid: {response.get('valid', False)}")

Option 1: No-Code with n8n#

Infrastructure Daily Health Report Workflow#

Schedule Trigger: Every morning at 8am
HTTP Request: Datadog Metrics API — query avg:system.cpu.user{*} and avg:system.mem.used{*} for the past 24 hours
HTTP Request: Datadog Monitors API — fetch all monitors in Alert or Warn state
Code node: Identify hosts with CPU > 80% or memory > 90%, count triggered monitors by service
OpenAI: "Write a 5-bullet infrastructure health summary. Flag any critical capacity issues. Recommend one preventive action."
Slack: Post to #infra-ops channel

For Datadog in n8n, use HTTP Request nodes with DD-API-KEY and DD-APPLICATION-KEY headers — no native node required for most Datadog operations.

Option 2: LangChain with Python#

Build Datadog Tools#

import os
import time
import requests
from datetime import datetime, timedelta, timezone
from langchain.tools import tool
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("DATADOG_API_KEY")
APP_KEY = os.getenv("DATADOG_APP_KEY")
SITE = os.getenv("DATADOG_SITE", "datadoghq.com")
DD_BASE = f"https://api.{SITE}"


def dd_headers() -> dict:
    """Return Datadog authentication headers."""
    return {
        "DD-API-KEY": API_KEY,
        "DD-APPLICATION-KEY": APP_KEY,
        "Content-Type": "application/json"
    }


def dd_get(path: str, params: dict = None) -> dict:
    """Execute a Datadog API GET request."""
    resp = requests.get(f"{DD_BASE}{path}", headers=dd_headers(), params=params or {})
    resp.raise_for_status()
    return resp.json()


def dd_post(path: str, json_data: dict = None) -> dict:
    """Execute a Datadog API POST request."""
    resp = requests.post(f"{DD_BASE}{path}", headers=dd_headers(), json=json_data or {})
    resp.raise_for_status()
    return resp.json()


@tool
def query_metrics(metric_query: str, hours: int = 1) -> str:
    """
    Query Datadog metrics using Datadog query language.
    metric_query: e.g., 'avg:system.cpu.user{*}' or 'sum:nginx.net.request_per_s{env:prod}'
    hours: lookback window in hours (default 1).
    """
    now = int(time.time())
    start = now - (hours * 3600)

    data = dd_get("/api/v1/query", {
        "from": start,
        "to": now,
        "query": metric_query
    })

    series = data.get("series", [])
    if not series:
        return f"No data returned for query: {metric_query}"

    lines = [f"Metric query: {metric_query} (last {hours}h)"]
    for s in series[:5]:  # Show up to 5 series
        scope = s.get("scope", "global")
        points = s.get("pointlist", [])
        if points:
            values = [p[1] for p in points if p[1] is not None]
            if values:
                avg = sum(values) / len(values)
                max_val = max(values)
                min_val = min(values)
                lines.append(f"  {scope}: avg={avg:.2f}, max={max_val:.2f}, min={min_val:.2f}")
    return "\n".join(lines)


@tool
def get_triggered_monitors(priority: str = None) -> str:
    """
    List monitors currently in Alert or Warn state.
    priority: filter by 'alert' or 'warn' (optional, returns both if not specified).
    """
    params = {"monitor_states": "Alert,Warn", "page_size": 50}
    if priority:
        params["monitor_states"] = priority.title()

    data = dd_get("/api/v1/monitor", params=params)
    monitors = data if isinstance(data, list) else data.get("monitors", [])

    # Filter to only those with triggered state
    triggered = [m for m in monitors if m.get("overall_state") in ("Alert", "Warn", "No Data")]

    if not triggered:
        return "No monitors currently in Alert or Warn state"

    lines = [f"Triggered monitors ({len(triggered)}):"]
    for mon in triggered[:20]:
        name = mon.get("name", "Unnamed")[:70]
        state = mon.get("overall_state", "Unknown")
        mon_id = mon.get("id")
        query = mon.get("query", "")[:80]
        lines.append(f"  [{state}] {name}\n    ID: {mon_id} | Query: {query}")
    return "\n".join(lines)


@tool
def mute_monitor(monitor_id: int, duration_hours: int = 1, message: str = "") -> str:
    """
    Mute a Datadog monitor for a specified duration (e.g., during planned maintenance).
    monitor_id: the numeric monitor ID.
    duration_hours: how many hours to mute (default 1).
    """
    end_time = int(time.time()) + (duration_hours * 3600)
    payload = {"end": end_time}
    if message:
        payload["message"] = message

    dd_post(f"/api/v1/monitor/{monitor_id}/mute", payload)
    return (f"Monitor {monitor_id} muted for {duration_hours} hour(s). "
            f"Auto-unmutes at {datetime.fromtimestamp(end_time).strftime('%Y-%m-%d %H:%M UTC')}")


@tool
def search_logs(query: str, hours: int = 1, limit: int = 50) -> str:
    """
    Search Datadog logs for a query pattern.
    query: Datadog log search query (e.g., 'service:api status:error').
    hours: lookback window in hours.
    """
    now = datetime.now(timezone.utc)
    start = now - timedelta(hours=hours)

    payload = {
        "filter": {
            "query": query,
            "from": start.strftime("%Y-%m-%dT%H:%M:%SZ"),
            "to": now.strftime("%Y-%m-%dT%H:%M:%SZ")
        },
        "page": {"limit": limit},
        "sort": "timestamp"
    }

    data = dd_post("/api/v2/logs/events/search", payload)
    logs = data.get("data", [])

    if not logs:
        return f"No logs found for query: '{query}' in the last {hours}h"

    lines = [f"Log search: '{query}' — {len(logs)} results (last {hours}h):"]

    # Count messages by frequency
    message_counts = {}
    for log in logs:
        attrs = log.get("attributes", {})
        msg = attrs.get("message", "")[:100]
        message_counts[msg] = message_counts.get(msg, 0) + 1

    lines.append("\nTop error patterns:")
    for msg, count in sorted(message_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        lines.append(f"  ({count}x) {msg}")

    return "\n".join(lines)


@tool
def get_service_summary(service_name: str, hours: int = 24) -> str:
    """
    Get APM performance summary for a service including request rate, error rate, and latency.
    service_name: the Datadog service name (e.g., 'web-api', 'checkout-service').
    """
    now = int(time.time())
    start = now - (hours * 3600)
    scope = f"service:{service_name}"

    metrics = {
        "Request rate (req/s)": f"sum:trace.web.request.hits{{{scope}}}.as_rate()",
        "Error rate (%)": f"sum:trace.web.request.errors{{{scope}}}.as_rate()",
        "P95 latency (ms)": f"p95:trace.web.request.duration{{{scope}}}"
    }

    lines = [f"Service summary: {service_name} (last {hours}h):"]
    for label, query in metrics.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.2f}")
                else:
                    lines.append(f"  {label}: No data")
        except Exception as e:
            lines.append(f"  {label}: Error querying ({str(e)[:50]})")

    return "\n".join(lines)


@tool
def get_infrastructure_overview(hours: int = 1) -> str:
    """Get a high-level overview of infrastructure health: CPU, memory, and disk across all hosts."""
    now = int(time.time())
    start = now - (hours * 3600)

    infra_queries = {
        "Avg CPU %": "avg:system.cpu.user{*}",
        "Avg Memory %": "avg:system.mem.pct_usable{*}",
        "Hosts reporting": "avg:system.cpu.user{*} by {host}"
    }

    lines = [f"Infrastructure overview (last {hours}h):"]
    for label, query in infra_queries.items():
        try:
            data = dd_get("/api/v1/query", {"from": start, "to": now, "query": query})
            series = data.get("series", [])
            if label == "Hosts reporting":
                lines.append(f"  Active hosts: {len(series)}")
            elif series:
                points = [p[1] for p in series[0].get("pointlist", []) if p[1] is not None]
                if points:
                    avg = sum(points) / len(points)
                    lines.append(f"  {label}: {avg:.1f}%")
        except Exception:
            pass

    return "\n".join(lines)

Datadog Observability Agent#

from langchain_openai import ChatOpenAI
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o", temperature=0)
tools = [query_metrics, get_triggered_monitors, mute_monitor,
         search_logs, get_service_summary, get_infrastructure_overview]

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an infrastructure observability assistant with access to Datadog.

When investigating issues:
1. Start with triggered monitors to understand the current alert state
2. Query specific metrics to quantify the problem scope (affected hosts, error rates, latency)
3. Search logs to find error messages and patterns that explain the root cause
4. Correlate metric and log signals to build a complete picture
5. Muting monitors requires explicit confirmation of monitor ID — never mute without verifying

For metric queries, use Datadog query syntax:
- avg:metric_name{scope} — average across matching hosts
- sum:metric_name{*}.as_rate() — rate of change
- p95:metric_name{env:prod} — 95th percentile latency"""),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True, max_iterations=8)

# Infrastructure investigation
result = executor.invoke({
    "input": "Check the current infrastructure health: any triggered monitors, overall CPU and memory usage, and search for any ERROR logs in the api service from the last hour."
})
print(result["output"])

Rate Limits and Best Practices#

Datadog API limit	Value
Metrics query rate	400 req/hour
Log search rate	300 req/hour
Monitor list rate	600 req/hour
Max metrics per query	300 series

Best practices:

Scope metric queries: Always filter with {host:name} or {service:name} tags rather than {*} on large infrastructure — wildcard queries can return hundreds of series and slow responses
Use as_rate() for counters: Counter metrics like request counts need .as_rate() in the query to get per-second rates instead of cumulative totals
Cache metric results for dashboards: For recurring summary queries (every 5 minutes), cache results client-side rather than re-querying Datadog on each agent invocation
Tag your mutes with messages: Always include a message when muting monitors explaining why — the message field appears in the audit trail and helps teammates understand muted alerts

Next Steps#

AI Agents PagerDuty Integration — Correlate Datadog alerts with PagerDuty incident management
AI Agents Slack Integration — Route Datadog alerts and AI summaries to Slack channels
AI Agents GitHub Integration — Correlate metric regressions with recent code deployments
Build an AI Agent with LangChain — Complete agent framework tutorial

How to Integrate AI Agents with Datadog

What AI Agents Can Do With Datadog Access#

Setting Up Datadog API Access#

Option 1: No-Code with n8n#

Infrastructure Daily Health Report Workflow#

Option 2: LangChain with Python#

Build Datadog Tools#

Datadog Observability Agent#

Rate Limits and Best Practices#

Next Steps#

How to Integrate AI Agents with Datadog

What AI Agents Can Do With Datadog Access#

Setting Up Datadog API Access#

Option 1: No-Code with n8n#

Infrastructure Daily Health Report Workflow#

Option 2: LangChain with Python#

Build Datadog Tools#

Datadog Observability Agent#

Rate Limits and Best Practices#

Next Steps#