Verifying AWS DevOps Agent — How Skills Change Investigation Quality

Introduction

In the previous article, I set up AWS DevOps Agent in a minimal configuration (single account + CloudWatch only) and investigated a stress-ng CPU spike via Chat. The agent identified the root cause (stress-ng command execution) in 4 minutes 25 seconds.

In practice, however, generic investigation isn't always enough. Teams have specific metrics to check first, and organizations have reporting format requirements. Skills, added in the GA release, let you teach the agent your organization-specific investigation procedures.

Skills are based on a subset of the agentskills.io open specification, with instructions written in Markdown in a SKILL.md file. Key features:

Agent Type targeting: Choose from Generic (all types) / On-demand / Incident Triage / Incident RCA / Incident Mitigation / Evaluation to control where Skills apply
Two creation methods: UI creation in the Operator Access, or zip upload with references/ and assets/ directories
Active/Inactive toggle: Temporarily disable without deleting
No CLI support: Skill management is only available through the Operator Access (no skill-related APIs exist in CLI or SDK)

This article creates a Skill for EC2 CPU spike incidents and compares investigation results with and without the Skill on the same stress-ng scenario, quantifying how Skills affect the investigation process and output.

Prerequisites:

An active Agent Space from the previous article
Operator Access access
AWS CLI v2, EC2/CloudWatch permissions

If you only want the results, skip to Summary.

Setup

EC2 Environment

I rebuilt the same configuration from the previous article (EC2 t3.micro + stress-ng + CloudWatch alarm).

EC2 + CloudWatch alarm + stress-ng setup steps

Terminal

REGION=ap-northeast-1
 
AMI_ID=$(aws ssm get-parameters \
  --names /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
  --query "Parameters[0].Value" --output text --region $REGION)
 
SG_ID=$(aws ec2 create-security-group \
  --group-name devops-agent-skills-verification \
  --description "SG for DevOps Agent Skills verification" \
  --region $REGION --query "GroupId" --output text)
 
INSTANCE_ID=$(aws ec2 run-instances \
  --image-id $AMI_ID --instance-type t3.micro \
  --security-group-ids $SG_ID \
  --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=devops-agent-skills-test}]" \
  --region $REGION --query "Instances[0].InstanceId" --output text)
 
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
 
# Wait for SSM Agent
while true; do
  STATUS=$(aws ssm describe-instance-information \
    --filters "Key=InstanceIds,Values=${INSTANCE_ID}" \
    --query "InstanceInformationList[0].PingStatus" \
    --output text --region $REGION 2>/dev/null)
  [ "$STATUS" = "Online" ] && break
  echo "Waiting for SSM Agent..." && sleep 10
done
 
# CloudWatch alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "devops-agent-skills-high-cpu" \
  --metric-name CPUUtilization --namespace AWS/EC2 \
  --statistic Average --period 60 --threshold 80 \
  --comparison-operator GreaterThanThreshold --evaluation-periods 1 \
  --dimensions "Name=InstanceId,Value=${INSTANCE_ID}" \
  --region $REGION
 
# Run stress-ng (30 minutes)
aws ssm send-command \
  --instance-ids $INSTANCE_ID \
  --document-name "AWS-RunShellScript" \
  --parameters '{"commands":["sudo dnf install -y stress-ng","nohup stress-ng --cpu 2 --timeout 1800 &"]}' \
  --region $REGION

Terminal (confirm alarm fires)

while true; do
  STATE=$(aws cloudwatch describe-alarms \
    --alarm-names "devops-agent-skills-high-cpu" \
    --query "MetricAlarms[0].StateValue" --output text \
    --region $REGION)
  echo "Alarm state: $STATE"
  [ "$STATE" = "ALARM" ] && break
  sleep 60
done

Creating the Skill

From the Skills page in Operator Access, I created a Skill specialized for EC2 CPU spike incidents. To open Operator Access, go to the DevOps Agent page in the AWS Management Console, select your Agent Space, and click "Operator access." Then click "Skills" in the left menu.

The Skill contains three elements:

Investigation priority: Procedures to prioritize CPU credit analysis for T-series instances
Process-level root cause identification: Steps to check SSM command history for load-generating tools
Report format: A 3-part structure — Impact Scope → Root Cause (including intentional/unintentional determination) → Actions (immediate/short-term/long-term)

Full SKILL.md content (created in Operator Access)

In the Operator Access Skills page, select "Add skill" → "Create skill" and enter:

Name: ec2-cpu-investigation
Description: Investigation procedures for EC2 CPU spike incidents. Use when CloudWatch CPU alarms fire. Covers T-series burstable instance CPU credit exhaustion, process-level root cause identification, and SSM command history analysis. Use when CPU utilization is high, CPUUtilization alarms fire, or burst performance issues are suspected.
Agent Type: Generic (all agent types)
Instructions:

SKILL.md (Instructions)

# EC2 CPU Spike Investigation Procedures
 
## Investigation Priority
 
Follow this order:
 
### Step 1: Basic Instance Information
- Check instance type; if T-series (t3, t3a, t4g), always proceed to Step 2
- Check instance launch time; if launched within 5 minutes, consider initialization processes
 
### Step 2: T-Series CPU Credit Analysis (T-series only)
Always check these CloudWatch metrics:
- `CPUCreditBalance`: If 0, burst performance is throttled
- `CPUCreditUsage`: Credit consumption rate
- `CPUSurplusCreditBalance`: Surplus credit usage (additional charges)
- `CPUSurplusCreditsCharged`: Charged surplus credits
 
Baseline performance by instance type:
| Type | vCPU | Baseline |
|------|------|----------|
| t3.micro | 2 | 10% |
| t3.small | 2 | 20% |
| t3.medium | 2 | 20% |
| t3.large | 2 | 30% |
 
### Step 3: Process-Level Root Cause Identification
- Check SSM RunCommand history for recently executed commands
- Look for load-generating tools: `stress-ng`, `stress`, `dd`, `yes`
- If CloudWatch Agent is installed, check `procstat` metrics
 
### Step 4: Infrastructure Change Review
- Check CloudTrail for recent instance-related events
- Review security group changes, IAM role changes
 
## Report Format
 
Report findings in this format:
 
### 1. Impact Scope
- Affected resources (instance ID, related services)
- Impact start time and duration
 
### 2. Root Cause
- Cause details (process name, command, executor)
- Determination: intentional or unintentional
 
### 3. Actions
- **Immediate**: Actions to take right now
- **Short-term**: Improvements within 1 week
- **Long-term**: Architecture-level improvements

Verification: How Skills Change Investigation Quality

Using the same stress-ng scenario (EC2 t3.micro, CPU 100%, CloudWatch alarm in ALARM state), I ran investigations via Chat without and with the Skill, then compared results. Open Chat from the "Chat" option in the Operator Access's left menu.

Note: stress-ng auto-terminates after --timeout 1800 (30 minutes). If Skill creation takes a while, re-run stress-ng before starting the investigation.

Baseline: Investigation Without Skill

Before creating the Skill, I ran the investigation without it.

Chat prompt: "CloudWatch alarm devops-agent-skills-high-cpu is firing for instance i-0123456789abcdef0 in ap-northeast-1. CPU utilization remains high. Investigate and identify the root cause."

Results:

Investigation time: 164 seconds (2m 44s)
Tool calls: 38
skill_read: 2 (built-in skills; Skill had not been created yet)
Root cause: Correctly identified stress-ng
Report format: Free-form (conclusion → root cause → timeline → findings → recommendations)
Action categories: Short-term / medium-to-long-term (2 tiers)

DevOps Agent internally invokes multiple "tools" to gather information when responding to a Chat query. Tools include AWS API calls (use_aws), CloudTrail event lookups (lookup_cloudtrail_events), and Skill loading (skill_read). The number and types of tool calls can be observed from the EventStream API response.

The investigation was accurate, but the report format was determined by the agent, and actions used a coarse 2-tier classification.

Baseline investigation response (excerpt)

Investigation Report: High CPU Utilization on EC2 Instance i-0123456789abcdef0

Conclusion This high CPU utilization is caused by an intentional stress test, not a production incident. User "(user-name)" is running the stress-ng utility for CPU stress testing.

Root Cause: CPU Stress Test Running Currently, stress-ng --cpu 2 --timeout 1800 executed via AWS Systems Manager (SSM) is active.

Key Findings

Intentional test: Planned stress test for verification purposes

T3.micro CPU credit exhaustion: CPU credit balance: 0, surplus credits in use

Recommended Actions

Short-term:

Install CloudWatch Agent for process-level monitoring

Medium-to-long-term:

Consider upgrading to t3.small or larger

The report used a free-form structure (conclusion → root cause → timeline → findings → recommendations) with a 2-tier action classification.

Investigation With Skill

With ec2-cpu-investigation Skill set to Active, I sent the same prompt from a new Chat session.

Results:

Investigation time: 64 seconds (1m 4s)
Tool calls: 11
skill_read: 1 (ec2-cpu-investigation — the Skill)
Root cause: Correctly identified stress-ng
Report format: Skill-guided structured format (Impact Scope → Root Cause → Determination → Actions)
Action categories: Immediate / short-term / long-term (3 tiers, as specified in Skill)

Skill-enabled investigation response (excerpt)

🔍 Investigation Report

1. Impact Scope

Affected Resource: EC2 instance i-0123456789abcdef0 (devops-agent-skills-test)

Instance Type: t3.micro (vCPU: 2, baseline: 10%)

Impact Start: 2026-04-01 03:05:00 UTC approx.

2. Root Cause Intentional load generation test using stress-ng is running.

CPU Credit Status:

CPUCreditBalance: 0 (fully exhausted)

CPUSurplusCreditBalance: 26.5 (surplus credits in use)

3. Determination Determined to be an intentional load test.

4. Actions

Immediate

Stop the stress-ng process

Short-term (within 1 week)

Set up a separate environment for load testing

Long-term

Build load test automation

The Skill's "Impact Scope → Root Cause → Actions" structure was reflected, with "Determination" added as a separate section. Actions were classified into 3 tiers.

Comparison

Item	Without Skill	With Skill
Investigation time	164s	64s
Tool calls	38	11
Skills loaded	2 built-in	1 Skill
Root cause identified	✅ stress-ng	✅ stress-ng
CPUCreditBalance mentioned	Yes	Yes (with value: 0)
CPUSurplusCreditBalance mentioned	Yes	Yes (with value: 26.5)
Baseline performance explained	10%	10% (referencing Skill's table)
Report format	Free-form	Skill-guided structured format
Action categories	Short-term / medium-to-long-term	Immediate / short-term / long-term
Intentional/unintentional determination	Yes	Yes (explicitly per Skill instructions)

The most notable differences were investigation time reduction (164s → 64s, 61% faster) and tool call reduction (38 → 11, 71% fewer). The baseline investigation included broad exploration such as describe_instances calls across multiple regions, which were absent in the Skill-enabled investigation. The Skill's priority specification may have contributed to narrowing the search scope, but since this was the second execution, changes in the agent's internal state could also be a factor — the causal relationship cannot be definitively established.

The report format reflected the Skill's structure (Impact Scope, Root Cause, Actions). Rather than using the Skill's 3-part structure as-is, the agent output a 4-part structure with "Determination" added as a separate section. However, all elements specified in the Skill (impact scope, root cause, intentional/unintentional determination, immediate/short-term/long-term actions) were present in the output.

Note that this comparison is based on a single execution of each scenario. The time difference includes network latency and API response time variance, so "Skills always reduce investigation time by 61%" cannot be generalized.

Agent Type Targeting

I changed the Skill's Agent Type to "Incident RCA" only and sent the same prompt from an On-demand Chat. To change it, open the Skill in the Operator Access Skills page, click "Edit," change Agent Type from "Generic" to "Incident RCA," and click "Save."

Results:

skill_read: 0 (Skill was not loaded)
Tool calls: 2 only
Response: Did not perform a deep investigation; asked "Shall I start the investigation?"

After reverting Agent Type to "Generic," the Skill becomes available again. Agent Type targeting controls Skill applicability as intended.

Summary

Adding a single Skill produced clear changes in DevOps Agent's investigation process.

Investigation efficiency — With the Skill, tool calls dropped from 38 to 11 and investigation time from 164s to 64s. As noted in the comparison analysis, whether the Skill's priority specification or changes in the agent's internal state was the primary factor cannot be determined, but the elimination of broad region scanning was confirmed
Report format control — All elements specified in the Skill (impact scope, root cause, intentional/unintentional determination, immediate/short-term/long-term actions) were present in the output. However, the agent used a 4-part structure with "Determination" as a separate section rather than the Skill's 3-part structure
Agent Type targeting — Confirmed that targeting a Skill to "Incident RCA" only prevents it from loading in On-demand Chat. According to the documentation, Agent Types can be used to apply different Skills to different investigation phases, but this operational pattern was not tested in this verification
No CLI support — Skill management is Operator Access only; no CLI/SDK APIs exist. IaC management and CI/CD pipeline integration are not possible at this time

Observations on Skill design from this verification (based on a single Skill and scenario; additional testing is needed to generalize):

Specify priority data sources — Write conditional priorities like "if T-series instance, always check CPU credit metrics"
Specify report format — Explicitly define output structure like "Impact Scope → Root Cause → Actions"
Provide judgment criteria — Describe the decision axes you want the agent to apply, like "determine whether the cause is intentional or unintentional"

Cleanup

Resource deletion steps

Terminal

REGION=ap-northeast-1
 
# Terminate EC2 instance
aws ec2 terminate-instances --instance-ids $INSTANCE_ID --region $REGION
 
# Delete CloudWatch alarm
aws cloudwatch delete-alarms \
  --alarm-names "devops-agent-skills-high-cpu" --region $REGION
 
# Delete security group after instance terminates
aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID --region $REGION
aws ec2 delete-security-group --group-id $SG_ID --region $REGION

The Skill created in the Operator Access can be deleted or set to Inactive from the Skills page. The Agent Space and IAM roles are kept for future verification.

Verifying AWS DevOps Agent — How Skills Change Investigation Quality

Introduction

Setup

EC2 Environment

Creating the Skill

Verification: How Skills Change Investigation Quality

Baseline: Investigation Without Skill

Investigation With Skill

Comparison

Agent Type Targeting

Summary

Cleanup

Related Posts

Verifying AWS DevOps Agent — Using EKS Knowledge Graphs to Automatically Identify Kubernetes Root Causes

Verifying AWS DevOps Agent — What Prevention Actually Recommends

Verifying AWS DevOps Agent — Setup, Incident Investigation, and Chat in a Minimal Configuration