AI for SRE, Testing, Databases, Security

AWS DevOps Agent: Autonomous Incident Response Walkthrough

If you are on call and tired of waking up to a paging storm with no context, the AWS DevOps Agent is built for exactly that pain. It is an autonomous SRE agent that triages alerts, correlates telemetry across your stack, and publishes a root cause before a human even opens a laptop. This walkthrough is for backend and platform engineers who already run production on AWS and want to wire the agent into a real incident pipeline rather than read marketing copy. By the end, you will know how to create an Agent Space, connect your observability and code sources, route alerts in, and review the mitigation plan the agent generates.

The AWS DevOps Agent reached general availability in 2026 and is built on Amazon Bedrock AgentCore. Importantly, it investigates and recommends, but it does not push changes to your infrastructure on its own. Human approval gates every mitigation. That distinction shapes how you should adopt it, so we will return to it throughout.

What Is AWS DevOps Agent?

AWS DevOps Agent is an autonomous operations agent that investigates production incidents end to end. When an alert fires, it triages the event, correlates metrics, logs, traces, and recent deployments, identifies a probable root cause, and drafts a phased mitigation plan. It runs investigations without human prompting, yet it stops short of executing any change until a person approves it.

The agent is not a chatbot bolted onto CloudWatch. Instead, it orchestrates a multi-step investigation across the tools your team already uses. Built-in integrations cover CloudWatch, Datadog, Dynatrace, New Relic, Splunk, Grafana, GitHub, GitLab, and Azure DevOps. For anything not natively supported, it extends through the Model Context Protocol, so it can reach on-premises systems or another cloud. If you are new to that protocol, our guide to the Model Context Protocol explains how these connections work under the hood.

How AWS DevOps Agent Handles Incident Response

The agent moves through four distinct stages. Understanding them helps you decide where to plug in your own data and where to keep a human in the loop.

  1. Triage — The agent ingests the incoming event and decides within seconds whether to start a new investigation, link it to an active one, or skip it.
  2. Investigation — It builds an investigation plan, then queries your connected sources to gather findings.
  3. Root cause analysis — It correlates the findings, weighs deployment timing against the symptom window, and publishes a probable cause.
  4. Mitigation planning — On request, it generates a structured, multi-phase remediation plan that a human reviews and approves.

The triage stage matters more than people expect. When a new incident arrives, the agent compares it against active investigations inside a look-back window of roughly 20 minutes. It weighs component overlap, region, and timing, then makes one of three calls: LinkedSkipped, or Proceed. As a result, a cascading outage that fires fifteen alarms becomes one investigation instead of fifteen, which is the difference between signal and noise during a real incident.

Prerequisites

Before you start, make sure you have the following in place.

  • An AWS account with permissions to create an Agent Space and IAM roles
  • At least one observability source already collecting data (CloudWatch is the simplest starting point)
  • A code source such as GitHub or GitLab so the agent can correlate deployments with incidents
  • An alerting source you want to trigger investigations (CloudWatch alarms, PagerDuty, Grafana, or a ServiceNow queue)
  • Familiarity with IAM roles and least-privilege policies, covered in our practical breakdown of AWS IAM roles and policies

You do not need to instrument anything new to get value. If you already ship logs and metrics to CloudWatch, you can run a first investigation in under an hour.

Step 1: Create an Agent Space

An Agent Space is the boundary that defines which tools, data sources, and credentials your agent can reach. Think of it as the scope of one application or one team. Give it a descriptive name that reflects that scope, because you will likely run several, one per service domain.

You can create an Agent Space through the AWS console or the CLI. The console flow walks you through naming the space, choosing a response language, and selecting the IAM role the agent assumes during investigations. Keep that role tight. The agent needs read access to your telemetry and deployment history, not write access to production.

A common early mistake is creating one giant Agent Space for the whole organization. That dilutes the agent’s context and makes its correlation noisier. Instead, scope each space to a bounded set of services that share alarms and deployment pipelines.

Step 2: Connect Your Observability and Code Sources

With the space created, register the data sources the agent will query. Native integrations register through the console under the Agent Space capabilities. For each one, you complete an OAuth flow or supply an API token, then grant read-only access.

GitHub registration happens in two parts. First, you install the AWS DevOps Agent GitHub app at the account level through an OAuth flow. Next, you connect specific repositories to an individual Agent Space under the Pipeline section. The app requests read-only access and receives deployment events, which is what lets the agent line up a bad deploy against a spike in errors.

For sources that are not natively supported, you supply an MCP server configuration. The agent reads from that endpoint the same way it reads from a built-in integration. A typical MCP entry looks like this:

{
  "mcpServers": {
    "splunk": {
      "url": "https://splunk.internal.example.com/mcp",
      "auth": {
        "type": "bearer",
        "token": "${SPLUNK_MCP_TOKEN}"
      }
    }
  }
}

Store the token in a secrets manager rather than inline. The agent only needs a role with read capabilities on the target system, so when you mint the Splunk token, assign a role that exposes search but grants no administrative permissions.

Step 3: Wire Up Event Sources

Investigations start in three ways: built-in integrations, webhooks, or manual triggers. For automation, you want events to flow in without a human clicking anything.

Built-in integrations are the cleanest path. Connect a ticketing system such as ServiceNow, and the agent automatically opens an investigation when a ticket is created. It then writes its findings, root cause, and mitigation plan back into the originating ticket, so your responders never leave their existing tool.

Webhooks cover everything else. PagerDuty, Grafana, Datadog, and Dynatrace can all POST to the agent’s webhook endpoint. The webhook expects a defined JSON schema and authenticates with HMAC-SHA256, so you sign each payload. Here is a production-ready signer in Node.js:

import { createHmac } from "node:crypto";

// Signs and posts an incident event to the AWS DevOps Agent webhook.
// HMAC-SHA256 over "timestamp:payload" prevents replay and tampering.
export async function sendIncident(webhookUrl, secret, incident) {
  const payload = {
    eventType: "incident",
    incidentId: incident.id,
    action: "created",
    priority: incident.priority, // CRITICAL | HIGH | MEDIUM | LOW | MINIMAL
    title: incident.title,
    description: incident.description,
    service: incident.service,
  };

  const timestamp = new Date().toISOString();
  const body = JSON.stringify(payload);
  const signature = createHmac("sha256", secret)
    .update(`${timestamp}:${body}`, "utf8")
    .digest("base64");

  const response = await fetch(webhookUrl, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "x-amzn-event-timestamp": timestamp,
      "x-amzn-event-signature": signature,
    },
    body,
  });

  if (!response.ok) {
    throw new Error(`Webhook rejected: ${response.status} ${response.statusText}`);
  }
}

Notice the signature covers both the timestamp and the body. That binding is what prevents an attacker from replaying an old payload, so do not drop the timestamp to save a line.

To route CloudWatch alarms in, bridge them through EventBridge and a small Lambda. The flow is CloudWatch Alarm → EventBridge → Lambda → DevOps Agent webhook. The Lambda translates the alarm into the incident schema and calls the signer above:

import { sendIncident } from "./sendIncident.mjs";

const WEBHOOK_URL = process.env.DEVOPS_AGENT_WEBHOOK_URL;
const WEBHOOK_SECRET = process.env.DEVOPS_AGENT_WEBHOOK_SECRET;

// EventBridge delivers CloudWatch alarm state changes here.
export const handler = async (event) => {
  const alarm = event.detail;

  // Only escalate alarms that actually entered the ALARM state.
  if (alarm.state?.value !== "ALARM") {
    return { skipped: true };
  }

  await sendIncident(WEBHOOK_URL, WEBHOOK_SECRET, {
    id: event.id,
    priority: "HIGH",
    title: `${alarm.alarmName} entered ALARM`,
    description: alarm.state.reason,
    service: alarm.configuration?.metrics?.[0]?.metricStat?.metric?.namespace,
  });

  return { forwarded: true };
};

Pull the webhook URL and secret from environment variables backed by Secrets Manager, never hardcode them. If you are still standing up your alarms, our walkthrough of CloudWatch logging, metrics, and alarms covers the alarm setup this Lambda depends on.

Step 4: Encode Runbooks as Skills

The agent gets sharper when you teach it your environment. Skills are self-contained directories of Markdown instructions, and optionally PDFs, images, or data files, that give the agent specialized knowledge. You author them in the Operator console under the Skills section.

A skill is where you encode the runbook your senior engineers carry in their heads. For example, a skill might tell the agent that error spikes on the checkout service almost always trace back to the payment provider’s rate limits, and that it should check those headers first. Skills also drive triage: you can write skip criteria so routine, self-healing alarms never spin up an investigation.

Keep skills focused. One skill per failure domain reads better to the agent than a single sprawling document, much the same way small, single-purpose functions are easier to reason about than a monolith.

Step 5: Run Your First Investigation

You do not need a live incident to test the setup. Open the Incident Response tab of your Agent Space web app and start a manual investigation. You can type free-form text such as “Investigate elevated 5xx errors on the orders API over the last hour,” or pick a preconfigured starting point like Latest alarm, High CPU usage, or Error rate spike.

When you start, the agent asks for a few details to focus its work:

  • Investigation details — the description, which you can refine
  • Investigation starting point — a specific alarm, metric, or log snippet to anchor on
  • Date and time of incident — defaults to now in UTC
  • Priority — defaults to Medium

You can also kick off investigations programmatically. The CLI exposes aws devopsagent create-backlog-task, which lets your incident management system open investigations without anyone touching the console. That is the hook you use when you want your existing tooling to stay the front door.

Step 6: Review the Root Cause and Mitigation Plan

Once the investigation runs, the agent populates a Root Cause tab with its findings: the metrics and logs it analyzed, the deployments it reviewed, and the temporal relationships it found. This is the artifact your on-call engineer reads first, and it is the whole point of the system. Instead of starting from a blank dashboard at 3 a.m., they start from a documented hypothesis.

When you are ready, choose “Generate mitigation plan.” The plan arrives in four phases:

  1. Prepare — setup steps before any change
  2. Pre-Validate — checks that confirm the system is in the expected state
  3. Apply — the actual remediation actions
  4. Post-Validate — success criteria that confirm the fix worked

Each phase lists concrete actions, often including commands to update infrastructure-as-code or configuration. For code-level fixes, the agent produces agent-ready specs designed to hand off to a coding agent such as Kiro, so the diagnosis-to-fix path stays structured rather than copy-pasted from memory.

Step 7: Keep a Human in the Loop

This is the rule that should anchor your rollout: the AWS DevOps Agent never executes changes on its own. It investigates and recommends, and a person must approve before any mitigation runs. You approve through the web app, or programmatically with aws devopsagent update-backlog-task if you are integrating with an external approval workflow.

If the agent gets stuck or you want expert eyes, you can escalate to AWS Support directly from the Agent Space. The agent passes its full investigation log to the support engineer, so you skip the usual back-and-forth of explaining what you have already tried. That escalation requires support:CreateCase and support:DescribeCases permissions on the agent’s role, plus an eligible support plan.

A Realistic Incident Scenario

Consider a mid-sized SaaS platform running a dozen services on ECS, with a small platform team covering on call. During a routine afternoon deploy, the checkout service starts returning intermittent 503s. CloudWatch fires three alarms in quick succession: elevated 5xx, rising latency, and a target-group health check failure.

Without the agent, the on-call engineer would correlate those three alarms by hand, check the deploy timeline, and grep logs across services. With the agent, the triage stage links all three alarms into a single investigation within seconds, because they share the service, region, and timing window. The investigation then pulls the recent GitHub deployment, notices it landed two minutes before the first alarm, and inspects connection-pool metrics. It publishes a probable root cause: the new release lowered the database connection-pool ceiling, and traffic exhausted it under load.

The engineer wakes up to a documented hypothesis and a four-phase rollback plan rather than a wall of red. AWS reports root cause accuracy around 94% for the agent, and early adopters have cited mean-time-to-resolution improvements of up to 75%. Treat those as vendor and early-adopter figures rather than guarantees, but even a fraction of that reduction changes what an incident feels like. The trade-off is real, too: the agent’s hypothesis is only as good as the telemetry and skills you feed it, and a thinly instrumented service yields a thin investigation.

When to Use AWS DevOps Agent

  • You run production workloads on AWS and already collect metrics, logs, and deployment history
  • Your team experiences alert fatigue and spends real time on manual triage and correlation
  • You want a documented root cause and a structured mitigation plan before a human engages
  • You operate across multiple clouds or on-premises systems and can connect them through MCP
  • You want to encode senior-engineer runbooks as reusable skills rather than tribal knowledge

When NOT to Use AWS DevOps Agent

  • Your services are thinly instrumented, so there is little telemetry for the agent to correlate
  • You expect fully automated remediation; the agent recommends but never executes changes itself
  • Your incident volume is low enough that manual triage is not a meaningful cost
  • Strict data-governance rules prevent connecting your observability and code sources to the agent
  • You are looking for a real-time monitoring dashboard rather than an investigation engine

Common Mistakes with AWS DevOps Agent

  • Creating one oversized Agent Space for the whole org, which dilutes context and weakens correlation
  • Granting the agent’s IAM role write access to production when it only needs read access for investigations
  • Hardcoding the webhook URL and signing secret instead of pulling them from Secrets Manager
  • Skipping skills entirely, which leaves the agent guessing about environment-specific failure patterns
  • Treating the mitigation plan as auto-approved and removing the human review gate the agent depends on

Conclusion

The AWS DevOps Agent shifts incident response from reactive scrambling toward autonomous investigation, so your engineers start from a documented root cause instead of a blank dashboard. Set it up by scoping an Agent Space, connecting your observability and code sources, routing alerts through webhooks or built-in integrations, and encoding your runbooks as skills, while keeping a human approval gate on every mitigation. Start small: wire one service’s CloudWatch alarms to a single Agent Space and run a manual investigation this week to see the quality of its root cause analysis firsthand. To go deeper on the building blocks, explore our guides on how AI agents plan and execute with toolsmonitoring and logging microservices with Prometheus and Grafana, and the Amazon Q Developer CLI for AWS-native AI assistance in your terminal.

1 Comment

Leave a Comment