plugins/claude-ops/skills/ops-fires/SKILL.md
Production incidents dashboard. Reads ECS health, Sentry errors, CI failures. Offers to dispatch fix agents for active fires.
npx skillsauth add davepoon/buildwithclaude ops-firesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before executing, load available context:
Daemon health: Read ${CLAUDE_PLUGIN_DATA_DIR:-$HOME/.claude/plugins/data/ops-ops-marketplace}/daemon-health.json
infra-monitor service status — if not running, pre-gathered infra data may be staleaction_needed is not null → surface it immediately as a potential fireSecrets: AWS credentials are required for ECS/CloudWatch queries.
$AWS_ACCESS_KEY_ID / $AWS_PROFILE env varsdoppler secrets get AWS_ACCESS_KEY_ID --plain (if doppler configured in prefs)password_manager_config.query_cmd from preferences$SENTRY_AUTH_TOKEN → Doppler SENTRY_AUTH_TOKEN → vaultPreferences: Read ${CLAUDE_PLUGIN_DATA_DIR}/preferences.json for secrets_manager config to know which vault to query.
| Command | Usage | Output |
|---------|-------|--------|
| aws ecs list-services --cluster <name> --query 'serviceArns' | ECS services | ARN list |
| aws ecs describe-services --cluster <name> --services <arn> --query 'services[0].{status:status,running:runningCount,desired:desiredCount}' | Service health | JSON |
| aws logs tail /ecs/<service> --since 1h --format short | ECS logs | Log lines (use with Monitor for live) |
| Command | Usage | Output |
|---------|-------|--------|
| gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt | Recent CI runs | JSON array |
| gh run view <id> --repo <repo> --log-failed | Failed CI logs | Log output |
| Command | Usage | Output |
|---------|-------|--------|
| sentry-cli issues list --project <slug> --status unresolved | Unresolved issues | Issue list |
| curl -H "Authorization: Bearer $SENTRY_AUTH_TOKEN" "https://sentry.io/api/0/projects/<org>/<proj>/issues/?query=is:unresolved" | API fallback when MCP unavailable | JSON array |
If CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 is set, use Agent Teams when dispatching multiple fix agents simultaneously. This enables:
Team setup (only when flag is enabled, dispatch phase):
TeamCreate("fire-fixers")
Agent(team_name="fire-fixers", name="fix-[service]", ...)
If the flag is NOT set, use standard parallel subagents.
${CLAUDE_PLUGIN_ROOT}/bin/ops-infra 2>/dev/null || echo '{"clusters":[],"error":"infra check failed"}'
${CLAUDE_PLUGIN_ROOT}/bin/ops-ci 2>/dev/null || echo '[]'
${CLAUDE_PLUGIN_ROOT}/bin/ops-external 2>/dev/null || echo '[]'
Analyze the pre-gathered data — including external projects. Then run parallel checks:
gh run list --limit 20 --json status,conclusion,name,headBranch,createdAt 2>/dev/nullauth_expired as HIGH (credential rotation needed), unreachable/degraded as MEDIUM, not_configured as LOW.Classify each issue by severity:
| Severity | Criteria | | -------- | ------------------------------------------------- | | CRITICAL | Service down, DB unreachable, auth broken | | HIGH | Elevated error rate, deploy stuck, CI main broken | | MEDIUM | Non-critical service degraded, flaky tests | | LOW | Warning-level, non-urgent |
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OPS ► FIRES DASHBOARD — [timestamp]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
CRITICAL
[service] — [issue] — [since]
HIGH
[service] — [issue] — [since]
MEDIUM
[service] — [issue] — [since]
ECS HEALTH
[cluster] [service] [desired/running] [status]
CI STATUS
[repo] [branch] [workflow] [status] [last run]
SENTRY (top errors, 24h)
[error] [count] [first seen] [project]
EXTERNAL PROJECTS
[alias] [source] [status] [details — e.g. auth_expired, unreachable]
──────────────────────────────────────────────────────
Use batched AskUserQuestion calls (max 4 options each). Only show relevant actions (e.g., skip dispatch options if no issues found):
AskUserQuestion call 1:
[Dispatch fix agent for [top critical issue]]
[Dispatch fix agent for [second issue]]
[View logs for [service]]
[More...]
AskUserQuestion call 2 (only if "More..."):
[Open Sentry dashboard]
[Open GitHub Actions]
[All clear — nothing to do]
If no fires: show "ALL SYSTEMS OPERATIONAL" with last-checked timestamps.
When user selects to fix an issue, use AskUserQuestion to confirm the scope before dispatching:
Dispatch fix agent for: [issue title]
Severity: [CRITICAL/HIGH/MEDIUM]
Repo: [repo]
Error: [brief description]
The agent will:
- Investigate root cause in [repo]
- Create feature branch with fix
- Open PR for review
[Dispatch agent] [Show me the logs first] [Skip — I'll fix manually]
On confirmation, spawn an Agent with:
Use the agents/infra-monitor.md agent definition for infra issues.
If $ARGUMENTS contains a project alias, filter to that project's services only.
Use Monitor to stream ECS task logs or GitHub Actions runs when investigating fires:
Monitor(command: "aws logs tail /ecs/<service> --follow --since 5m")
Use TaskCreate for each active fire. Update with TaskUpdate as fires are investigated/fixed/escalated.
When diagnosing fires, use WebFetch to check AWS status page (https://health.aws.amazon.com/health/status), Vercel status, or third-party API status pages.
Use WebSearch to find if the error pattern matches a known AWS/infrastructure issue (e.g., "ECS task stopped CannotPullContainerError" → known ECR throttling).
tools
Assesses the current state of the startup project and recommends what to focus on next. Use when there is a need or a question from the user to understand what the next steps are or what to focus on next.
data-ai
Use at the start of any conversation about a startup idea, product validation, founder strategy, or work inside a `startup/` workspace. Establishes file conventions, voice-input handling, subagent dispatch rules, and how to update each artifact safely. Activate before invoking any other startup-superpowers skill.
tools
Manages the founder's survey-based validation — crafting the right questions, deploying a survey to the internet, and analyzing results against hypotheses. Use when the founder wants to run a survey, create survey questions, validate hypotheses at scale, check how a survey is going, understand whether a survey is the right tool right now, or deploy a question set to get quantitative signal. Also bring this up if you believe that creating a survey to collect quantitative evidence may be useful at this point.
development
Guides the founder through designing and optionally building the simplest MVP or prototype that validates their current hypotheses. Use when the founder wants to build something to test assumptions, discusses what to build next, wants to interpret results from a live MVP, or is deciding whether the current approach is still right. Also use when a founder proposes something to build — the skill will check whether the proposed form is the simplest thing that generates honest signal.