claude/skills/monitor-deployment/SKILL.md
Monitor a PR post-merge deployment by polling Sentry, Datadog, and CI/CD checks, then send a macOS notification with a health verdict and action recommendation. Use when asked to watch a deployment, monitor after a merge, or track post-deployment health.
npx skillsauth add iainmcl/dotfiles monitor-deploymentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Watch a merged (or soon-to-merge) PR and assess deployment health using available monitoring signals. Send a macOS notification with a clear verdict and recommended action when the picture is clear enough to act on.
Sandbox note: All
ghcommands requiredangerouslyDisableSandbox: true.
Extract from the user's message:
{OWNER}/{REPO}#{PR_NUMBER}If the PR URL is missing, ask for it before proceeding.
Before starting, look for a DEPLOYMENT_MONITOR config block in the project's
CLAUDE.md. If found, use those intervals. Otherwise use the default.
Default check schedule (seconds after deploy confirmed live):
| Check | Delay | What it catches | |---|---|---| | 1 | 0 s (immediate) | Build/startup failures, crash-on-boot | | 2 | +30 s | Fast-failing errors, broken routes | | 3 | +60 s (1 min) | Initial traffic errors | | 4 | +180 s (3 min) | Stabilising error rates | | 5 | +600 s (10 min) | Sustained degradation | | 6 | +1200 s (20 min) | Long-tail issues, slow memory leaks |
Total window: ~34 minutes. Stop early if verdict is HEALTHY for two
consecutive checks after check 3, or immediately if verdict is INCIDENT.
Project CLAUDE.md override format (place in the project's CLAUDE.md):
## DEPLOYMENT_MONITOR
check_schedule_seconds: [0, 30, 90, 270, 870, 2070]
# ^ override the delay offsets from deploy-confirmed time
# Tune upward for low-traffic services, downward for high-traffic ones
When a custom schedule is found, log it at the start: Using project check schedule: [...]
gh pr view {PR_NUMBER} --repo {OWNER}/{REPO} \
--json state,mergedAt,mergeCommit,headRefName,title,baseRefName
state is not MERGED, poll every 2 minutes (use /loop 2m or tell the
user to re-run when merged). Do not proceed with monitoring until merged.mergedAt — use this as the deployment start time for log/metric queries.gh run list --repo {OWNER}/{REPO} --branch {BASE_REF} --limit 5 \
--json databaseId,name,status,conclusion,createdAt,url
Look for a deploy workflow that started after mergedAt. If one is in progress,
poll until it completes:
gh run watch {RUN_ID} --repo {OWNER}/{REPO}
If the deploy workflow failed, skip to Phase 5: Notify with verdict DEPLOY_FAILED.
gh api repos/{OWNER}/{REPO}/deployments \
--jq '[.[] | select(.created_at > "{MERGED_AT}") | {id,environment,created_at,ref}]'
Check the latest deployment status:
gh api repos/{OWNER}/{REPO}/deployments/{DEPLOYMENT_ID}/statuses \
--jq '.[0] | {state,description,log_url}'
At the moment the deploy is confirmed live, calculate the absolute time for each check by adding the schedule offsets to now. Cron has 1-minute resolution, so checks 1 and 2 (T+0 and T+30s) run synchronously in this session; checks 3–6 are scheduled as one-shot cron jobs.
now = <current wall-clock time>
check_times = [now, now+30s, now+90s, now+270s, now+870s, now+2070s]
(or project override offsets if set in CLAUDE.md)
Run check 1 now. Then sleep 30 and run check 2. These happen synchronously
because 30 seconds is below cron's 1-minute resolution.
At each check, print:
[Check N/6 — T+{ELAPSED}s] Running monitoring queries...
Run all available monitoring queries (§4a–4d). Skip tools gracefully if credentials are absent; record which were skipped.
Apply early-exit rules after each check (see below).
After check 2, schedule checks 3–6 using CronCreate with recurring: false.
Compute the cron expression from the wall-clock fire time for each check:
check 3 → fire at {HH}:{MM} (T+90s rounded to nearest minute)
check 4 → fire at {HH}:{MM} (T+270s)
check 5 → fire at {HH}:{MM} (T+870s)
check 6 → fire at {HH}:{MM} (T+2070s)
Each cron prompt must be self-contained. Use this template:
Deployment monitor check {N}/6 for {OWNER}/{REPO}#{PR_NUMBER} ("{PR_TITLE}").
Merged at: {MERGED_AT}. Deploy confirmed at: {DEPLOY_CONFIRMED_AT}. Elapsed: ~T+{OFFSET}s.
Remaining cron job IDs to cancel on early exit: {JOB_ID_LIST}.
Run the monitoring queries from §4a–4d of the monitor-deployment skill.
Synthesise a verdict per §5. Send the macOS notification per §6.
Present findings and recommendation per §7.
Early-exit: if verdict is INCIDENT or DEPLOY_FAILED, call CronDelete on each
remaining job ID. If this is check 4+ and verdict is HEALTHY, and the previous
check was also HEALTHY, call CronDelete on remaining jobs and send a final
"deployment stable" notification.
Store all returned job IDs immediately after scheduling so they can be passed into each prompt and cancelled if needed.
Early-exit rules (apply after every check):
INCIDENT or DEPLOY_FAILED → cancel all remaining cron jobs, skip to §5HEALTHY results → cancel remaining jobs,
send final healthy notification and stopUse the Sentry MCP tools (already configured — no credentials needed):
mcp__sentry__search_issues(query="is:unresolved firstSeen:>{MERGED_AT}", limit=20)
If a specific issue looks related to the deploy, fetch full details:
mcp__sentry__get_issue_details(issue_id="{ISSUE_ID}")
Flag as a problem if:
fatal or error-level issues appeared after mergedAtRequires DD_API_KEY and DD_APP_KEY in the environment.
# Error rate metric for the past window
curl -s "https://api.datadoghq.com/api/v1/query" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-G --data-urlencode "from=$(date -d '{MERGED_AT}' +%s)" \
--data-urlencode "to=$(date +%s)" \
--data-urlencode "query=avg:trace.servlet.request.errors{env:production}.as_rate()" \
| jq '.series[0].pointlist | map(.[1])'
Adapt the metric name to match the project's conventions. Ask the user for the correct service tag/metric if unknown.
Flag as a problem if:
curl -s "https://api.pagerduty.com/incidents" \
-H "Authorization: Token token=$PAGERDUTY_TOKEN" \
-H "Accept: application/vnd.pagerduty+json;version=2" \
-G --data-urlencode "statuses[]=triggered" \
--data-urlencode "statuses[]=acknowledged" \
--data-urlencode "since={MERGED_AT}" \
| jq '[.incidents[] | {id,title,urgency,created_at,service}]'
Any open high-urgency incident after mergedAt is an automatic red flag.
gh api repos/{OWNER}/{REPO}/commits/{MERGE_COMMIT_SHA}/check-runs \
--jq '[.check_runs[] | {name,status,conclusion,details_url}]'
Evaluate all signals and assign one of four verdicts:
| Verdict | Meaning |
|---|---|
| HEALTHY | No errors, no anomalies, all checks green |
| DEGRADED | Minor error spike or partial check failure — elevated risk |
| INCIDENT | Active errors, paging alerts, or failed checks |
| DEPLOY_FAILED | The deploy workflow itself did not succeed |
| UNKNOWN | Too many checks were skipped to make a reliable call |
osascript -e 'display notification "{BODY}" with title "Deployment: {VERDICT}" subtitle "{PR_TITLE}" sound name "{SOUND}"'
Sound mapping:
HEALTHY → "Glass"DEGRADED → "Purr"INCIDENT / DEPLOY_FAILED → "Basso"UNKNOWN → "Tink"Body format (keep under 100 chars):
HEALTHY: "All checks green at T+{ELAPSED}. No action needed."DEGRADED: "Check {N}: error spike at T+{ELAPSED}. Monitor — see Claude."INCIDENT: "URGENT: Active errors at T+{ELAPSED}. Immediate action required."DEPLOY_FAILED: "Deploy workflow failed. PR changes not in production."UNKNOWN: "Check {N}: incomplete data at T+{ELAPSED}. Manual check needed."Print a concise report in the conversation:
## Deployment Health: {VERDICT}
PR: {TITLE} (#{PR_NUMBER})
Merged: {MERGED_AT}
Monitoring window: {N} min
### Signals
- GitHub Actions: {pass/fail/skipped}
- Sentry: {clean/N new errors/skipped}
- Datadog: {normal/elevated/skipped}
- PagerDuty: {no incidents/N open/skipped}
### Recommendation
{see below}
HEALTHY
All signals are green. No action required. Consider updating your ticket/PR as successfully deployed.
DEGRADED
Error rate is elevated but not critical. Continue monitoring for the next {N} minutes. If it doesn't recover, prepare a rollback. Do not deploy further changes until stable.
INCIDENT
Active errors detected after this deploy. Recommended actions in order:
- Revert — if the change is isolatable and a revert is low-risk:
gh pr revert {PR_NUMBER} --repo {OWNER}/{REPO}- Hotfix — if reverting would cause more disruption, prepare a targeted fix on a new branch and fast-track it through review.
- Feature flag — if the feature can be toggled off without a deploy, disable it immediately and investigate.
Alert your team now. Do not wait for auto-recovery.
DEPLOY_FAILED
The deploy pipeline did not complete. Production is unchanged. Check the workflow logs: {RUN_URL} Fix the pipeline issue and re-trigger the deploy, or escalate to the platform/infrastructure team if the failure is not in your code.
UNKNOWN
Not enough monitoring data was available to make a reliable assessment. Manually verify error rates in Sentry/Datadog before declaring the deploy healthy. Checks skipped: {LIST}.
If the user resolves an INCIDENT (reverts, hotfixes, or feature-flags the
issue) and wants to re-verify, first cancel any remaining cron jobs from the
previous run (CronList → CronDelete each), then run the skill again from
scratch. The new run restarts the check schedule from T+0 against the new deploy.
Session caveat: cron jobs are session-only. If Claude exits, all scheduled checks are lost. For unattended overnight monitoring, use an external scheduler (GitHub Actions workflow dispatch, system cron) that re-invokes Claude via API.
development
Run a weekly achievement review - pulls from Jira, GitHub, and Slack to capture what you shipped in the last week, maps achievements to your 2026 goals, and appends impact-focused entries to your brag doc. Use when asked to "do a weekly review", "capture this week's wins", "update my brag doc", "what did I ship this week", "record my achievements", "what have I done this week", "add to my performance doc", or anything about tracking weekly progress, brag doc entries, or performance evidence. Trigger even if the user just says "weekly review" or "document what I did".
testing
Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
tools
Set up a project update config for the current repo, so that running project-update requires no setup questions. Use when asked to "set up project updates", "configure project update", "initialise project update", or "create a project update config". Run this once per project repo.
testing
Find the highest-frequency unresolved Sentry error for the VAT & Invoicing or Billing team, understand its root cause, create a Jira ticket in the APP project, implement a fix, and open a draft PR. Use when asked to "fix sentry issues", "triage sentry errors", "look at sentry", "what's broken in sentry", "create a fix for a sentry issue", or "sentry triage". Runs the full flow autonomously in the background.