artifacts/bundle/skills/engineering-team/incident-commander/SKILL.md
# Incident Commander Skill **Category:** Engineering Team **Tier:** POWERFUL **Author:** Claude Skills Team **Version:** 1.0.0 **Last Updated:** February 2026 ## Overview The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, time
npx skillsauth add neekware/ehayeskills artifacts/bundle/skills/engineering-team/incident-commanderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Category: Engineering Team
Tier: POWERFUL
Author: Claude Skills Team
Version: 1.0.0
Last Updated: February 2026
The Incident Commander skill provides a comprehensive incident response framework for managing technology incidents from detection through resolution and post-incident review. This skill implements battle-tested practices from SRE and DevOps teams at scale, providing structured tools for severity classification, timeline reconstruction, and thorough post-incident analysis.
Incident Classifier (incident_classifier.py)
Timeline Reconstructor (timeline_reconstructor.py)
PIR Generator (pir_generator.py)
Definition: Complete service failure affecting all users or critical business functions
Characteristics:
Response Requirements:
Communication Frequency: Every 15 minutes until resolution
Definition: Significant degradation affecting subset of users or non-critical functions
Characteristics:
Response Requirements:
Communication Frequency: Every 30 minutes during active response
Definition: Limited impact with workarounds available
Characteristics:
Response Requirements:
Communication Frequency: At key milestones only
Definition: Minimal impact, cosmetic issues, or planned maintenance
Characteristics:
Response Requirements:
Communication Frequency: Standard development cycle updates
Command and Control
Communication Hub
Process Management
Post-Incident Leadership
Emergency Decisions (SEV1/2):
Resource Allocation:
Technical Decisions:
Subject: [SEV{severity}] {Service Name} - {Brief Description}
Incident Details:
- Start Time: {timestamp}
- Severity: SEV{level}
- Impact: {user impact description}
- Current Status: {investigating/mitigating/resolved}
Technical Details:
- Affected Services: {service list}
- Symptoms: {what users are experiencing}
- Initial Assessment: {suspected root cause if known}
Response Team:
- Incident Commander: {name}
- Technical Lead: {name}
- SMEs Engaged: {list}
Next Update: {timestamp}
Status Page: {link}
War Room: {bridge/chat link}
---
{Incident Commander Name}
{Contact Information}
Subject: URGENT - Customer-Impacting Outage - {Service Name}
Executive Summary:
{2-3 sentence description of customer impact and business implications}
Key Metrics:
- Time to Detection: {X minutes}
- Time to Engagement: {X minutes}
- Estimated Customer Impact: {number/percentage}
- Current Status: {status}
- ETA to Resolution: {time or "investigating"}
Leadership Actions Required:
- [ ] Customer communication approval
- [ ] PR/Communications coordination
- [ ] Resource allocation decisions
- [ ] External vendor engagement
Incident Commander: {name} ({contact})
Next Update: {time}
---
This is an automated alert from our incident response system.
We are currently experiencing {brief description of issue} affecting {scope of impact}.
Our engineering team was alerted at {time} and is actively working to resolve the issue. We will provide updates every {frequency} until resolved.
What we know:
- {factual statement of impact}
- {factual statement of scope}
- {brief status of response}
What we're doing:
- {primary response action}
- {secondary response action}
Workaround (if available):
{workaround steps or "No workaround currently available"}
We apologize for the inconvenience and will share more information as it becomes available.
Next update: {time}
Status page: {link}
Internal Stakeholders:
External Stakeholders:
| Stakeholder | SEV1 | SEV2 | SEV3 | SEV4 | | ---------------------- | --------- | ----- | -------- | --------- | | Engineering Leadership | Real-time | 30min | 4hrs | Daily | | Executive Team | 15min | 1hr | EOD | Weekly | | Customer Support | Real-time | 30min | 2hrs | As needed | | Customers | 15min | 1hr | Optional | None | | Partners | 30min | 2hrs | Optional | None |
Detection Playbooks
Response Playbooks
Recovery Playbooks
# {Service/Component} Incident Response Runbook
## Quick Reference
- **Severity Indicators:** {list of conditions for each severity level}
- **Key Contacts:** {on-call rotations and escalation paths}
- **Critical Commands:** {list of emergency commands with descriptions}
## Detection
### Monitoring Alerts
- {Alert name}: {description and thresholds}
- {Alert name}: {description and thresholds}
### Manual Detection Signs
- {Symptom}: {what to look for and where}
- {Symptom}: {what to look for and where}
## Initial Response (0-15 minutes)
1. **Assess Severity**
- [ ] Check {primary metric}
- [ ] Verify {secondary indicator}
- [ ] Classify as SEV{level} based on {criteria}
2. **Establish Command**
- [ ] Page Incident Commander if SEV1/2
- [ ] Create incident tracking ticket
- [ ] Join war room: {link/bridge info}
3. **Initial Investigation**
- [ ] Check recent deployments: {deployment log location}
- [ ] Review error logs: {log location and queries}
- [ ] Verify dependencies: {dependency check commands}
## Mitigation Strategies
### Strategy 1: {Name}
**Use when:** {conditions}
**Steps:**
1. {detailed step with commands}
2. {detailed step with expected outcomes}
3. {validation step}
**Rollback Plan:**
1. {rollback step}
2. {verification step}
### Strategy 2: {Name}
{similar structure}
## Recovery and Validation
1. **Service Restoration**
- [ ] {restoration step}
- [ ] Wait for {metric} to return to normal
- [ ] Validate end-to-end functionality
2. **Communication**
- [ ] Update status page
- [ ] Notify stakeholders
- [ ] Schedule PIR
## Common Pitfalls
- **{Pitfall}:** {description and how to avoid}
- **{Pitfall}:** {description and how to avoid}
## Reference Information
→ See references/reference-information.md for details
## Usage Examples
### Example 1: Database Connection Pool Exhaustion
```bash
# Classify the incident
echo '{"description": "Users reporting 500 errors, database connections timing out", "affected_users": "80%", "business_impact": "high"}' | python scripts/incident_classifier.py
# Reconstruct timeline from logs
python scripts/timeline_reconstructor.py --input assets/db_incident_events.json --output timeline.md
# Generate PIR after resolution
python scripts/pir_generator.py --incident assets/db_incident_data.json --timeline timeline.md --output pir.md
```
# Quick classification from stdin
echo "API rate limits causing customer API calls to fail" | python scripts/incident_classifier.py --format text
# Build timeline from multiple sources
python scripts/timeline_reconstructor.py --input assets/api_incident_logs.json --detect-phases --gap-analysis
# Generate comprehensive PIR
python scripts/pir_generator.py --incident assets/api_incident_summary.json --rca-method fishbone --action-items
Maintain Calm Leadership
Document Everything
Effective Communication
Technical Excellence
Blameless Culture
Action Item Discipline
Knowledge Sharing
Continuous Improvement
The Incident Commander skill provides a comprehensive framework for managing incidents from detection through post-incident review. By implementing structured processes, clear communication templates, and thorough analysis tools, teams can improve their incident response capabilities and build more resilient systems.
The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
Creator: Engineering Team License: MIT Source Repo:
neekware/ehaye-skillsSource Bucket:engineering-teamOriginal Path:engineering-team/incident-commander
tools
# ehAye Multimedia Use this skill for **video, audio, images, media conversion, previews, transcription, thumbnails, frame extraction, Spotter visual search, or FFmpeg-backed processing**. Core rule: use ehAye native media tools first. Do not reach first for shell `ffmpeg`, `ffprobe`, Python, or `mediainfo` when a native media tool can do the job. Native tools use bundled engines, show proper tool UI, respect cancellation/timeouts, integrate with Preview/Spotter, and avoid cross-platform shell
development
Test-driven development skill for writing unit tests, generating test fixtures and mocks, analyzing coverage gaps, and guiding red-green-refactor workflows across Jest, Pytest, JUnit, Vitest, and Mocha. Use when the user asks to write tests, improve test coverage, practice TDD, generate mocks or stubs, or mentions testing frameworks like Jest, pytest, or JUnit. Handles test generation from source code, coverage report parsing (LCOV/JSON/XML), quality scoring, and framework conversion for TypeScript, JavaScript, Python, and Java projects.
tools
Help a user set up Telegram for ehAye Dojo. Default to Personal private bots (recommended). Group setup is advanced for teams/observers/demos.
development
# Writing Skills ## Overview **Writing skills IS Test-Driven Development applied to process documentation.** **Personal skills live in agent-specific directories (`~/.claude/skills` for Claude Code, `~/.agents/skills/` for Codex)** You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes). **Core principle:** If you didn't watch an agent fail without the ski