skills/davila7/it-operations/SKILL.md
Manages IT infrastructure, monitoring, incident response, and service reliability. Provides frameworks for ITIL service management, observability strategies, automation, backup/recovery, capacity planning, and operational excellence practices.
npx skillsauth add aiskillstore/marketplace it-operationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.
1. MONITORING & OBSERVABILITY
├─ Define SLIs/SLOs/SLAs for critical services
├─ Implement metrics collection (infrastructure, application, business)
├─ Configure alerting with proper thresholds and escalation
├─ Build dashboards for different audiences (ops, devs, executives)
└─ Establish on-call rotation and escalation procedures
2. INCIDENT MANAGEMENT
├─ Receive alert or user report
├─ Assess severity and impact (P1/P2/P3/P4)
├─ Engage appropriate responders
├─ Investigate and diagnose root cause
├─ Implement fix or workaround
├─ Communicate status to stakeholders
├─ Document resolution in knowledge base
└─ Conduct post-incident review
3. CHANGE MANAGEMENT
├─ Submit change request with impact assessment
├─ Review and approve through CAB (Change Advisory Board)
├─ Schedule change window
├─ Execute change with rollback plan ready
├─ Validate success criteria
├─ Document actual vs planned results
└─ Close change ticket
4. CAPACITY PLANNING
├─ Collect resource utilization trends
├─ Analyze growth patterns
├─ Forecast future requirements
├─ Plan procurement or provisioning
├─ Execute capacity additions
└─ Monitor effectiveness
5. AUTOMATION & OPTIMIZATION
├─ Identify repetitive manual tasks
├─ Document current process
├─ Design automated solution
├─ Implement and test automation
├─ Deploy to production
├─ Measure time/cost savings
└─ Iterate and improve
| Scenario | Alert Type | Threshold | Response Time | Escalation | |----------|-----------|-----------|---------------|------------| | Service completely down | Page | Immediate | < 5 min | Immediate to on-call | | Service degraded | Page | 2-3 failures | < 15 min | After 15 min to on-call | | High resource usage | Warning | > 80% sustained | < 1 hour | After 2 hours to team lead | | Approaching capacity | Info | > 70% trend | < 24 hours | Weekly capacity review | | Configuration drift | Ticket | Any deviation | < 7 days | Monthly review |
Priority 1 (Critical)
Priority 2 (High)
Priority 3 (Medium)
Priority 4 (Low)
Risk Level = Impact × Likelihood × Complexity
Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing
Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before
Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide
Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)
| Requirement | Prometheus + Grafana | Datadog | New Relic | ELK Stack | Splunk | |-------------|---------------------|---------|-----------|-----------|---------| | Cost | Free (self-hosted) | $$$$ | $$$$ | Free-$$ | $$$$$ | | Metrics | Excellent | Excellent | Excellent | Good | Good | | Logs | Via Loki | Excellent | Excellent | Excellent | Excellent | | Traces | Via Tempo | Excellent | Excellent | Limited | Good | | Learning Curve | Steep | Moderate | Moderate | Steep | Steep | | Cloud-Native | Excellent | Excellent | Excellent | Good | Good | | On-Premises | Excellent | Good | Good | Excellent | Excellent | | APM | Via exporters | Excellent | Excellent | Limited | Good |
Problem: Too many false positive alerts causing team burnout
Solution:
Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
- Actionable + Urgent = Keep as page
- Actionable + Not Urgent = Ticket
- Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
- MTTA (Mean Time to Acknowledge): < 5 min target
- False Positive Rate: < 20% target
- Alert Volume per Week: Trending down
Problem: Teams skip documentation during high-pressure incidents
Solution:
Problem: Critical knowledge trapped in individual team members' heads
Solution:
Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion
Problem: Operations team resists change to maintain stability
Solution:
Availability:
Formula: (Total Time - Downtime) / Total Time × 100
Target: 99.9% (43.8 min/month downtime)
Measurement: Per service, monthly
MTTR (Mean Time to Recovery):
Formula: Sum of recovery times / Number of incidents
Target: < 30 minutes for P1, < 4 hours for P2
Measurement: Per severity level, monthly
MTBF (Mean Time Between Failures):
Formula: Total operational time / Number of failures
Target: > 720 hours (30 days)
Measurement: Per service, quarterly
MTTA (Mean Time to Acknowledge):
Formula: Sum of acknowledgment times / Number of alerts
Target: < 5 minutes for pages
Measurement: Per on-call engineer, weekly
Change Success Rate:
Formula: Successful changes / Total changes × 100
Target: > 95%
Measurement: Monthly
Incident Recurrence Rate:
Formula: Repeat incidents / Total incidents × 100
Target: < 10%
Measurement: Quarterly (same root cause within 90 days)
Toil Percentage:
Definition: Time spent on manual, repetitive tasks
Target: < 30% of team capacity
Measurement: Weekly time tracking
Automation Coverage:
Formula: Automated tasks / Total repetitive tasks × 100
Target: > 70%
Measurement: Quarterly audit
On-Call Load:
Formula: Alerts per on-call shift
Target: < 5 actionable alerts per shift
Measurement: Per engineer, weekly
Runbook Coverage:
Formula: Services with runbooks / Total services × 100
Target: 100%
Measurement: Monthly audit
Knowledge Base Utilization:
Formula: Incidents resolved via KB / Total incidents × 100
Target: > 40%
Measurement: Monthly
Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action Items (with owners and due dates)
- Lessons Learned (shareable insights)
Rules:
- No blame or punishment
- Focus on systems and processes, not people
- Everyone can speak freely
- Action items must be tracked to completion
Runbook Contents:
- Service Overview: Purpose, dependencies, architecture
- SLIs/SLOs/SLAs: Defined thresholds and targets
- Common Issues: Symptoms, causes, solutions
- Troubleshooting Steps: Step-by-step procedures
- Escalation Paths: Who to contact and when
- Useful Commands: Copy-paste ready commands
- Dashboard Links: Direct links to relevant dashboards
- Recent Changes: Link to change log
- Contact Information: Team, product owner, SMEs
Maintenance:
- Review quarterly or after major incidents
- Test procedures during low-traffic periods
- Update after every significant change
- Track usage metrics (page views, helpfulness ratings)
On-Call Preparation:
- Laptop with VPN access
- Mobile device with notification apps
- Contact list (escalation paths)
- Access to all critical systems
- Runbooks bookmarked
- Backup on-call identified
During On-Call:
- Acknowledge alerts within 5 minutes
- Update incident status regularly
- Follow escalation procedures
- Document all actions in incident ticket
- Handoff clearly to next on-call
Post On-Call:
- Complete incident reports
- Submit toil reduction tickets
- Provide feedback on runbooks
- Update on-call documentation
Standard Change Process:
1. Create change request (RFC)
2. Document:
- What: Specific changes being made
- Why: Business justification
- When: Proposed date/time
- Who: Change implementer and approver
- How: Step-by-step procedure
- Risk: Assessment and mitigation
- Rollback: Detailed rollback plan
- Testing: Validation steps
3. Submit for CAB review (7 days advance notice)
4. Implement during approved window
5. Validate success criteria
6. Close change with actual results
7. Post-implementation review if issues occurred
Emergency Change Process:
- Executive approval required
- Implement with heightened monitoring
- Full team notification
- Complete documentation within 24 hours
- Mandatory post-change review
For detailed technical guidance, see:
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.