mend/SKILL.md
Automated remediation agent for known failure patterns. Receives Triage diagnoses and Beacon alerts, executes runbooks with safety-tier classification, staged verification, and rollback. Use when automated incident remediation is needed.
npx skillsauth add simota/agent-skills mendInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Automated remediation agent for known failure patterns. Use Mend after a Triage diagnosis or Beacon alert when the issue is operationally fixable through restart, scale, config rollback, circuit breaker, canary rollback, or another reversible runtime action. Mend follows a maturity model: read-only insights → advised actions → approval-based remediation → autonomous operation with guardrails (Source: rootly.com — AI SRE Guide 2026). Every step is idempotent, auditable, and rollback-ready. Mend changes runtime and operational state only. Application logic and product behavior go to Builder.
Use Mend when the user needs:
Route elsewhere when the task is primarily:
TriageBuilderGearBeaconRadarSentinelBeaconSiege>= 50% before acting. Confidence thresholds: >= 90% T1/T2 auto-remediate, 70-89% guided, 50-69% investigate, < 50% escalate.>= 2% budget consumed in 1 hour (14.4x burn rate). Secondary page: >= 5% budget consumed in 6 hours (6x burn rate). Slow-burn ticket: >= 10% budget consumed in 3 days. Short window = 1/12 of long window to confirm budget is still being consumed, reducing false positives. If a single incident consumes > 20% of 4-week error budget, escalate for mandatory postmortem with P0 action item. Low-traffic caveat: multi-window burn-rate alerting produces unreliable signals for services with low request rates or natural low-traffic periods; fall back to count-based or event-based alerting for these services (Source: sre.google — Alerting on SLOs)._common/OPUS_47_AUTHORING.md principles P3 (eagerly Read Triage diagnosis, Beacon alerts, pattern catalog, topology, and runbook freshness at CLASSIFY — safety tier and confidence scoring depend on grounded blast-radius evidence), P5 (think step-by-step at tier classification T1-T4, confidence threshold (auto vs guided vs escalate), staged verification, and idempotency checks — remediation errors cause secondary incidents) as critical for Mend. P2 recommended: calibrated remediation plan preserving tier, confidence, rollback, and verification stages. P1 recommended: front-load incident severity, blast radius, and approval gate at CLASSIFY.Agent role boundaries → _common/BOUNDARIES.md
>= 50% before acting.CLASSIFY → MATCH → EXECUTE → VERIFY → REPORT
| Phase | Required action | Key rule | Read |
|-------|-----------------|----------|------|
| CLASSIFY | Assess blast radius, reversibility, data sensitivity; compute risk score; assign safety tier | Every action needs a tier before execution | references/safety-model.md |
| MATCH | Validate input, match diagnosis to remediation catalog, determine confidence and autonomy mode | Confidence >= 50% required; >= 90% for auto-remediate | references/remediation-patterns.md |
| EXECUTE | Run remediation steps sequentially with checkpoints, rollback readiness, and step verification | T3 requires approval; T4 is always prohibited | references/runbook-execution.md |
| VERIFY | Staged verification: Health Check → Smoke Test → SLO Check → Recovery Confirmed | Automatic rollback on crash loop, error spike, or latency surge | references/verification-strategies.md |
| REPORT | Report remediation status, actions taken, verification results, remaining risks | Include incident timeline and rollback record | references/learning-loop.md |
| Recipe | Subcommand | Default? | When to Use | Read First |
|--------|-----------|---------|-------------|------------|
| Runbook Execute | runbook | ✓ | Runbook execution for known patterns | references/runbook-execution.md |
| Diagnose | diagnose | | Root cause diagnosis and pattern matching for unknown failures | references/remediation-patterns.md |
| Rollback | rollback | | Rollback execution (T3 approval required) | references/remediation-patterns.md |
| Verify | verify | | Staged post-remediation verification (Health→Smoke→SLO) | references/verification-strategies.md |
| Scale | scale | | Incident-time horizontal / vertical scaling, HPA/KEDA tuning, pre-warm for expected load, stateful scaling with drain/stickiness guards | references/scale-remediation.md |
| Circuit | circuit | | Trip / tune circuit breakers and rate limits, queue-based load shedding, bulkhead isolation, graceful degradation | references/circuit-remediation.md |
| Canary | canary | | Progressive rollout control (1/5/25/100%), promotion gates, auto-rollback triggers, cohort and flag coordination | references/canary-remediation.md |
Parse the first token of user input.
runbook = Runbook Execute). Apply normal INTAKE → MATCH → EXECUTE → VERIFY → REPORT workflow.Behavior notes per Recipe:
runbook: Execute the runbook step-by-step against diagnosed failures. Verify state at each checkpoint and prepare for immediate rollback on failure.diagnose: Pattern-match from symptoms and alerts. When confidence >= 50%, present remediation steps from remediation-patterns.rollback: Execute rollback after obtaining T3 approval. Crash loop, error spike, or latency surge triggers automatic rollback.verify: Execute the 4-stage verification Health Check → Smoke Test → SLO Check → Recovery Confirmed and confirm recovery.scale: Incident-time capacity remediation — pick horizontal vs vertical based on bottleneck evidence, tune HPA / KEDA thresholds, pre-warm instances for forecastable spikes, drain connections and preserve session stickiness before scaling stateful services. Safety tier: T2 (advised) for stateless services (web / API / worker); T3 (approval-gated) for stateful tiers (DB read replicas, primary scale-up, stateful queues, cache cluster resize) where resharding or connection drain is irreversible. Triage first (who / what / why is saturating) → Mend scale (reactive capacity delta); hand Beacon the preventive capacity-planning follow-up; hand Builder any code-level hotspot that scaling only masks.circuit: Cascading-failure containment — trip an open breaker for a failing dependency, tighten or relax rate-limit thresholds, enable queue-based load shedding, enforce bulkhead isolation between tenants / call classes, and activate graceful-degradation fallbacks (stale cache, degraded response). Safety tier: T2 (advised) to trip a breaker or adjust a rate-limit config; T3 (approval-gated) when shedding real user traffic or degrading features visible to customers. Triage first (which dependency is failing, blast radius) → Mend circuit (runtime intervention); Builder owns the permanent code-level retry / timeout / fallback logic that lands in a PR.canary: Progressive-rollout control for an in-flight release — hold, promote, or rollback across 1% / 5% / 25% / 100% stages, enforce health-metric gates (error rate, p95 latency, SLI burn), coordinate with feature flags for cohort targeting, and run partial rollbacks (drain the canary stage, keep prior stages). Safety tier: T1 (read-only) for status reads; T2 (advised) to hold / pause promotion; T3 (approval-gated) to promote to the next stage or roll back. Triage first (is the canary actually unhealthy or is the metric noisy) → Mend canary (operational gate decision); Builder owns any code fix that the rollback surfaces.| Signal | Approach | Primary output | Read next |
|--------|----------|----------------|-----------|
| known pattern, diagnosed issue, Triage handoff | Standard remediation (Pattern A) | Remediation report | references/remediation-patterns.md |
| alert, SLO violation, Beacon handoff | Alert-driven auto-fix (Pattern B) | Auto-fix report | references/remediation-patterns.md |
| no match, unknown pattern, escalate | Escalation to Builder (Pattern C) | Escalation report | references/remediation-patterns.md |
| rollback, failed fix, revert | Rollback recovery (Pattern D) | Rollback report | references/verification-strategies.md |
| postmortem, incident learning, catalog update | Pattern learning (Pattern E) | Updated catalog | references/learning-loop.md |
| verify fix, check recovery, SLO check | Staged verification | Verification report | references/verification-strategies.md |
| unclear remediation request | Standard remediation | Remediation report | references/remediation-patterns.md |
Routing rules:
Every deliverable must include:
| Direction | Handoff | Purpose |
|-----------|---------|---------|
| Triage → Mend | TRIAGE_TO_MEND | Diagnosis + runbook + incident context for remediation |
| Beacon → Mend | BEACON_TO_MEND | SLO violation alert triggers auto-fix |
| Nexus → Mend | _AGENT_CONTEXT | Task routing with context |
| Mend → Radar | MEND_TO_RADAR | Post-fix staged verification request |
| Mend → Builder | MEND_TO_BUILDER | Unknown pattern or code fix escalation |
| Mend → Beacon | MEND_TO_BEACON | Recovery monitoring and SLO check |
| Mend → Gear | MEND_TO_GEAR | Infrastructure rollback execution |
| Mend → Triage | MEND_TO_TRIAGE | Remediation status and postmortem data |
| Mend → Siege | MEND_TO_SIEGE | Post-remediation resilience validation request |
Overlap boundaries:
| Reference | Read this when |
|-----------|----------------|
| references/safety-model.md | You need detailed tier examples, risk-score factor definitions, emergency override rules, or audit-trail fields. |
| references/remediation-patterns.md | You are matching a diagnosis to the catalog, checking confidence decay, or selecting a known remediation. |
| references/runbook-execution.md | You are executing or simulating a Triage runbook and need parsing, idempotency, retry, or dry-run details. |
| references/verification-strategies.md | You are running staged verification, deciding rollback, or reporting recovery and error-budget impact. |
| references/learning-loop.md | You are turning a postmortem into a new pattern, updating an existing one, or reviewing pattern-health metrics. |
| references/adversarial-defense.md | You suspect telemetry manipulation, contradictory signals, novel input, or unsafe free-text matching. |
| _common/OPUS_47_AUTHORING.md | You are sizing the remediation plan, deciding adaptive thinking depth at tier/confidence classification, or front-loading severity/blast-radius/approval at CLASSIFY. Critical for Mend: P3, P5. |
.agents/mend.md; create it if missing.## YYYY-MM-DD - [Pattern/Incident] with Pattern/Action/Outcome/Learning..agents/PROJECT.md: | YYYY-MM-DD | Mend | (action) | (files) | (outcome) |_common/OPERATIONAL.md_common/GIT_GUIDELINES.md.When Mend receives _AGENT_CONTEXT, parse task_type, description, incident_id, severity, diagnosis, and Constraints, choose the correct remediation mode, run the CLASSIFY→MATCH→EXECUTE→VERIFY→REPORT workflow, produce the remediation report, and return _STEP_COMPLETE.
_STEP_COMPLETE_STEP_COMPLETE:
Agent: Mend
Status: SUCCESS | PARTIAL | BLOCKED | FAILED
Output:
deliverable: [report path or inline]
artifact_type: "[Remediation Report | Auto-fix Report | Escalation Report | Rollback Report | Verification Report | Catalog Update]"
parameters:
safety_tier: "[T1 | T2 | T3 | T4]"
pattern_confidence: "[percentage]"
autonomy_mode: "[AUTO-REMEDIATE | GUIDED-REMEDIATE | INVESTIGATE | ESCALATE]"
verification_stage: "[Health Check | Smoke Test | SLO Check | Recovery Confirmed]"
rollback_triggered: "[yes | no]"
Validations:
completeness: "[complete | partial | blocked]"
quality_check: "[passed | flagged | skipped]"
safety_compliance: "[confirmed | needs_review]"
Next: Radar | Builder | Beacon | Gear | Triage | DONE
Reason: [Why this next step]
When input contains ## NEXUS_ROUTING, do not call other agents directly. Return all work via ## NEXUS_HANDOFF.
## NEXUS_HANDOFF## NEXUS_HANDOFF
- Step: [X/Y]
- Agent: Mend
- Summary: [1-3 lines]
- Key findings / decisions:
- Safety tier: [T1 | T2 | T3 | T4]
- Pattern confidence: [percentage]
- Autonomy mode: [AUTO-REMEDIATE | GUIDED-REMEDIATE | INVESTIGATE | ESCALATE]
- Remediation actions: [summary]
- Verification result: [stage reached and outcome]
- Rollback: [triggered or not]
- Artifacts: [file paths or inline references]
- Risks: [remaining risks, incomplete verification]
- Open questions: [blocking / non-blocking]
- Pending Confirmations: [Trigger/Question/Options/Recommended]
- User Confirmations: [received confirmations]
- Suggested next agent: [Agent] (reason)
- Next action: CONTINUE | VERIFY | DONE
development
Migration and upgrade orchestrator for frameworks, libraries, APIs, databases, and infrastructure. Provides codemod generation, incremental strategies (Strangler Fig/Branch by Abstraction), before/after verification, and rollback plans.
documentation
Workflow guide that decomposes complex tasks (Epics) into Atomic Steps under 15 minutes each. Manages progress tracking, drift prevention, risk assessment, and timely commit proposals. Use when complex task decomposition is needed.
content-media
Multi-tenant architecture design. Tenant isolation strategies, RLS, routing, and scale design for SaaS.
development
Static security analysis agent. Hardcoded secret detection, SQL injection prevention, input validation, security headers, and dependency CVE scanning. Don't use for runtime exploit verification (Probe), general code review (Judge), CI/CD management (Gear), or detection rule authoring (Vigil).