.agents/skills/autonomy--sre-agent/SKILL.md
Diagnose production issues by querying logs, metrics, and traces from observability, cloud, and CI/CD CLIs. Use when investigating errors, debugging production issues, checking service health, reviewing CI/CD status, checking recent deployments, reproducing a reported bug with evidence, or when the user mentions logs, metrics, monitoring dashboards, container services, CI pipelines, or production incidents. Do not use when the issue is a known feature request or cosmetic change (use clarity--ticket-writer) or when auditing whether tools are accessible (use autonomy--sre-auditor).
npx skillsauth add patterninc/code-mint autonomy--sre-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Investigate production issues using observability, cloud, and CI/CD CLIs. Follow the investigation workflow in order. Do not skip to a fix before confirming the root cause.
This skill ships with Datadog Pup, AWS CLI, and GitHub CLI as a worked example. These are the tools the skill was validated against, and they serve as a complete reference implementation. If your project uses different tooling (e.g., Grafana/Loki, GCP Cloud Logging, Azure Monitor, GitLab CI, Jenkins), replace the tool-specific command sections while preserving the investigation workflow structure, hypothesis discipline, and [CUSTOMIZE] pattern.
This skill also serves as an onboarding proof tool. If the team wants to prove Bug Reproduction or SRE Investigation, use this workflow to gather evidence and record the outcome in docs/onboarding-checklist.md.
Ensure these are available before running queries. If any fail, stop and tell the user what's missing. If multiple tools are unavailable, advise running autonomy--sre-auditor to diagnose access issues.
Also check whether the project-local [CUSTOMIZE] sections in this skill and references/recipes.md have been filled in for the current repository. If key identifiers such as service names, log groups, AWS profiles, repo owner/name, or deploy workflow are still placeholders, stop and ask the user to complete the onboarding customization step first rather than guessing.
Pup is a Datadog CLI with 200+ commands across 33+ products. For installation, updates, and full API coverage, fetch the latest README: https://github.com/datadog-labs/pup
brew tap datadog-labs/pack && brew install datadog-labs/pack/pup
# Update: brew update && brew upgrade datadog-labs/pack/pup
Auth — Preferred: pup auth login (OAuth2, browser-based). Fallback: DD_API_KEY + DD_APP_KEY env vars (some commands like logs search require API key auth and don't support OAuth2). Check status: pup auth status.
App Key Setup — Some commands require an Application Key (DD_APP_KEY). Create one at Organization Settings -> Application Keys in Datadog with these recommended read-only scopes:
| Category | Scopes |
|---|---|
| Logs | logs_read_data, logs_read_config, logs_live_tail |
| Metrics | metrics_read, timeseries_query |
| APM | apm_read, apm_api_catalog_read, apm_service_catalog_read |
| Monitors | monitors_read |
| Dashboards | dashboards_read, notebooks_read |
| Events | events_read |
| SLOs | slos_read |
| Databases | dbm_read |
| Debugging | debugger_read, continuous_profiler_read, error_tracking_read |
aws sts get-caller-identity --profile [CUSTOMIZE: profile name]
Should succeed and return account ID, user ARN, and user ID. [CUSTOMIZE: Document your AWS profile names and which environments they map to.]
gh auth status
Should succeed. [CUSTOMIZE: Document your repository and whether to use --repo owner/repo or run from the repo directory.]
When this skill is used during onboarding rather than a live incident, the goal is to produce one of these proofs:
N/ARecord the evidence in docs/onboarding-checklist.md with:
Do not mark either outcome as complete until that evidence exists in-repo.
Use the CLIs themselves — don't memorize commands. Drill down progressively:
pup --help # Top-level command groups
pup logs --help # Subcommands for a group
pup logs search --help # Flags for a specific command
aws ecs help # AWS uses `help` without --
gh run --help # Same pattern as pup
[CUSTOMIZE: Replace this section with your project's actual infrastructure. These values are intended to be filled collaboratively during onboarding and preserved as project-local configuration.]
Check the root AGENTS.md "Related Repositories" section for the IaC companion repo (often {project}-tf). If you need to understand resource configuration, networking, IAM policies, or debug infrastructure-level issues, clone it:
git clone [CLONE_URL from AGENTS.md Related Repositories] /tmp/{repo-name}
[CUSTOMIZE: e.g., IaC repo: myapp-tf — contains Terraform modules for ECS, RDS, ElastiCache, ALB, IAM, and VPC configuration.]
| Component | Identifier | Scaling | |---|---|---| | [CUSTOMIZE: e.g., API (prod)] | [CUSTOMIZE: e.g., prod-myapp-api] | [CUSTOMIZE: e.g., 2-5 tasks] | | [CUSTOMIZE: e.g., Workers (prod)] | [CUSTOMIZE: e.g., prod-myapp-workers] | [CUSTOMIZE: e.g., 1-16 tasks] |
[CUSTOMIZE: e.g., Aurora PostgreSQL 16 (Serverless v2). Cluster: myapp-prod.]
[CUSTOMIZE: e.g., ElastiCache Redis cluster: myapp-redis-cluster.]
| Tag | Value |
|---|---|
| service | [CUSTOMIZE: e.g., myapp] |
| env | [CUSTOMIZE: e.g., dev / prod] |
[CUSTOMIZE: e.g., /ecs/prod-myapp-api, /ecs/prod-myapp-workers.]
[CUSTOMIZE: e.g., Dev: dev-myapp.example.com. Prod: myapp.example.com.]
pup logs search --query "service:[SERVICE] status:error" --from "1h"
pup logs search --query "service:[SERVICE] @http.status_code:500" --from "30m"
pup logs search --query "service:[SERVICE] @dd.trace_id:<TRACE_ID>"
Time ranges: 15m, 1h, 4h, 1d, 1w, or ISO timestamps. Use --limit N and --json as needed.
pup metrics query --query "avg:ecs.fargate.cpu.percent{service:[SERVICE],env:production}" --from "1h"
pup metrics query --query "sum:trace.[FRAMEWORK].request.errors{service:[SERVICE]}.as_count()" --from "1h"
pup metrics query --query "avg:trace.[FRAMEWORK].request.duration.by.resource_service.95p{service:[SERVICE]}" --from "1h"
Dashboards and SLOs don't support server-side filtering — pipe through jq:
pup monitors list --tags="service:[SERVICE]"
pup dashboards list | jq '[.data.dashboards[] | select(.title | test("[SERVICE]";"i")) | {id, title}]'
pup slos list | jq '[.data.data[] | select(.name | test("[SERVICE]";"i")) | {id, name}]'
Use pup dashboards get <id> to extract metric queries powering a dashboard, then run them with pup metrics query.
pup traces search --query "service:[SERVICE] status:error" --from "1h"
pup traces search --query "service:[SERVICE] resource_name:POST_/api/v1/endpoint" --from "1h"
Primarily used for ECS task health and CloudWatch log fallback.
# Service status
aws ecs describe-services --cluster [CLUSTER] --services [SERVICE] \
--profile [PROFILE] --region [REGION] \
--query 'services[0].{status:status,running:runningCount,desired:desiredCount}'
# Recently stopped tasks
aws ecs list-tasks --cluster [CLUSTER] --desired-status STOPPED \
--profile [PROFILE] --region [REGION]
# Why a task stopped
aws ecs describe-tasks --cluster [CLUSTER] --tasks <TASK_ARN> \
--profile [PROFILE] --region [REGION] \
--query 'tasks[0].{stoppedReason:stoppedReason,stopCode:stopCode,containers:containers[*].{name:name,exitCode:exitCode,reason:reason}}'
# CloudWatch log search (fallback when Datadog is unavailable)
aws logs filter-log-events --log-group-name [LOG_GROUP] \
--filter-pattern "ERROR" --start-time $(date -v-1H +%s000) \ # macOS; on GNU/Linux use: $(date -d '1 hour ago' +%s000)
--profile [PROFILE] --region [REGION]
Primarily used for CI status, deployment correlation, and commit history.
gh run list --limit 10 # Recent CI runs
gh run list --workflow [DEPLOY_WORKFLOW] --limit 5 # Recent deployments
gh run view <RUN_ID> --log-failed # Failed run logs
gh run view --job <JOB_ID> --log-failed # Failed job logs
# Recent commits (correlate with errors)
gh api repos/[OWNER]/[REPO]/commits \
--jq '.[:10] | .[] | {sha: .sha[:8], message: .commit.message, date: .commit.author.date}'
# Compare two deployments
gh api repos/[OWNER]/[REPO]/compare/COMMIT1...COMMIT2 \
--jq '.files[] | {filename, status, changes}'
Follow these steps in order:
pup metrics query for error counts and trends. Determine if intermittent or persistent before diving into logs.pup logs search for the error message, status code, or endpoint. Get stack traces.dd.trace_id from logs to find full APM traces.aws ecs describe-services for task counts and deployment state.gh run list --workflow [DEPLOY_WORKFLOW] to find recent deploys around the time errors started.gh api commit comparison to identify the culprit change.If the workflow above doesn't immediately surface the cause, form ranked hypotheses:
Common root cause categories:
Write a failing test case that reproduces the bug BEFORE proposing any fix. If you cannot reproduce the bug in a test, you do not understand it well enough to fix it.
For onboarding proof, the failing test or deterministic repro recipe is itself the evidence artifact. Capture it in docs/onboarding-checklist.md and update .agents/code-mint-status.json with the current bug_reproduction and sre_investigation outcome statuses and dates before moving on.
gh run list.[CUSTOMIZE] pattern intact.development
Best practices and template for creating new agent skills in the code-mint framework. Use when creating a new skill, authoring SKILL.md files, or structuring skill directories. Do not use when editing an existing skill's logic (edit directly instead) or when creating AGENTS.md files (use legibility--enhancer).
development
Step-by-step playbook that transforms a repository for AI-first development. It scopes the repository, runs read-only audits, maintains an outcome-driven checklist, guides collaborative improvements, and verifies the results with evidence. Use when onboarding a new repository, when a user says "set up this repo for agents," or when starting a harness engineering transformation. Do not use when the repo is already onboarded and the goal is to run a single specific skill.
development
Collaboratively walks a human through creating high-quality AGENTS.md files throughout a codebase, implementing progressive disclosure architecture. Use when creating or updating AGENTS.md files, onboarding a repository to agent-first practices, or remediating findings from legibility--auditor. Do not use when evaluating existing documentation coverage (use legibility--auditor) or when creating agent skills (use meta--skill-creator).
development
Audits a repository's documentation coverage and agent-readiness by evaluating AGENTS.md placement, progressive disclosure quality, and UX intent documentation. Use when evaluating repository structure, auditing documentation coverage, assessing agent-readiness, or onboarding a codebase to agent-first practices. Do not use when creating or editing AGENTS.md files directly (use legibility--enhancer instead).