Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aiskillstore/operating-production-services

Name: operating-production-services
Author: aiskillstore

skills/asmayaseen/operating-production-services/SKILL.md

npx skillsauth add aiskillstore/marketplace operating-production-services

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

| Need | Go To | |------|-------| | Define reliability targets | SLOs & Error Budgets | | Write incident report | Postmortem Templates | | Set up SLO alerting | references/slo-alerting.md |

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

| SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43 minutes | 8.76 hours | | 99.95% | 22 minutes | 4.38 hours | | 99.99% | 4.3 minutes | 52 minutes |

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy: | Budget Remaining | Action | |------------------|--------| | > 50% | Normal velocity | | 10-50% | Postpone risky changes | | < 10% | Freeze non-critical changes | | 0% | Feature freeze, fix reliability |

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | Punish individuals | Improve systems | | Hide information | Share learnings |

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

| Don't | Do Instead | |-------|------------| | Aim for 100% SLO | Accept error budget exists | | Skip small incidents | Small incidents reveal patterns | | Orphan action items | Every item needs owner + date + ticket | | Blame individuals | Ask "what conditions allowed this?" | | Create busywork actions | Actions should prevent recurrence |

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

aiskillstore/operating-production-services

skills/asmayaseen/operating-production-services/SKILL.md

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

230 stars

development

Updated Mar 28, 2026

$ install --global

skillsauth

npx skillsauth add aiskillstore/marketplace operating-production-services

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 30, 2026, 9:34 PM55.2s4 files scanned

SKILL.md

name:: operating-production-services
description:: |
SRE patterns for production service reliability:: SLOs, error budgets, postmortems, and incident response.

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

| Need | Go To | |------|-------| | Define reliability targets | SLOs & Error Budgets | | Write incident report | Postmortem Templates | | Set up SLO alerting | references/slo-alerting.md |

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | Punish individuals | Improve systems | | Hide information | Share learnings |

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

Related Skills

aiskillstore/hig-components-content

development

VerifiedTrustedCommunity

Apple Human Interface Guidelines for content display components. Use this skill when the user asks about charts component, collection view, image view, web view, color well, image well, activity view, lockup, data visualization, content display, displaying images, rendering web content, color pickers, or presenting collections of items in Apple apps. Also use when the user says how should I display charts, what's the best way to show images, should I use a web view, how do I build a grid of items, what component shows media, or how do I present a share sheet. Cross-references: hig-foundations for color/typography/accessibility, hig-patterns for data visualization patterns, hig-components-layout for structural containers, hig-platforms for platform-specific component behavior.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/hig-components-content

aiskillstore/helpdesk-automation

tools

VerifiedTrustedCommunity

Automate HelpDesk tasks via Rube MCP (Composio): list tickets, manage views, use canned responses, and configure custom fields. Always search tools first for current schemas.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/helpdesk-automation

aiskillstore/haskell-pro

testing

VerifiedTrustedCommunity

Expert Haskell engineer specializing in advanced type systems, pure functional design, and high-reliability software. Use PROACTIVELY for type-level programming, concurrency, and architecture guidance.

244SKILL.mdUpdated Apr 10, 2026

aiskillstore/haskell-pro

aiskillstore/graphql

tools

VerifiedTrustedCommunity

GraphQL gives clients exactly the data they need - no more, no less. One endpoint, typed schema, introspection. But the flexibility that makes it powerful also makes it dangerous. Without proper controls, clients can craft queries that bring down your server. This skill covers schema design, resolvers, DataLoader for N+1 prevention, federation for microservices, and client integration with Apollo/urql. Key insight: GraphQL is a contract. The schema is the API documentation. Design it carefully.

244SKILL.mdUpdated Apr 10, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aiskillstore/marketplace.git

# Copy into Claude Code skills folder (global)
cp -r marketplace/skills/asmayaseen/operating-production-services ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aiskillstore/marketplace

230 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT