skills/asmayaseen/operating-production-services/SKILL.md
SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).
npx skillsauth add aiskillstore/marketplace operating-production-servicesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production reliability patterns: measure what matters, learn from failures, improve systematically.
| Need | Go To | |------|-------| | Define reliability targets | SLOs & Error Budgets | | Write incident report | Postmortem Templates | | Set up SLO alerting | references/slo-alerting.md |
SLA (Contract) → SLO (Target) → SLI (Measurement)
# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
| SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43 minutes | 8.76 hours | | 99.95% | 22 minutes | 4.38 hours | | 99.99% | 4.3 minutes | 52 minutes |
Don't aim for 100%. Each nine costs exponentially more.
Error Budget = 1 - SLO Target
Example: 99.9% SLO = 0.1% error budget = 43 minutes/month
Policy: | Budget Remaining | Action | |------------------|--------| | > 50% | Normal velocity | | 10-50% | Postpone risky changes | | < 10% | Freeze non-critical changes | | 0% | Feature freeze, fix reliability |
See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.
| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | Punish individuals | Improve systems | | Hide information | Share learnings |
# Postmortem: [Incident Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX
## Executive Summary
One paragraph: what happened, impact, root cause, resolution.
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |
## Root Cause Analysis
### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]
## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X
## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |
# Quick Postmortem: [Title]
**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3
## What Happened
One sentence description.
## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution
## Root Cause
One sentence.
## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]
| Don't | Do Instead | |-------|------------| | Aim for 100% SLO | Accept error budget exists | | Skip small incidents | Small incidents reveal patterns | | Orphan action items | Every item needs owner + date + ticket | | Blame individuals | Ask "what conditions allowed this?" | | Create busywork actions | Actions should prevent recurrence |
Run: python scripts/verify.py
development
Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.
tools
Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.
testing
Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.
tools
GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.