Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

Asmayaseen/operating-production-services

Name: operating-production-services
Author: Asmayaseen

.claude/skills/operating-production-services/SKILL.md

npx skillsauth add Asmayaseen/hackathon-2 operating-production-services

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

| Need | Go To | |------|-------| | Define reliability targets | SLOs & Error Budgets | | Write incident report | Postmortem Templates | | Set up SLO alerting | references/slo-alerting.md |

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

| SLO % | Downtime/Month | Downtime/Year | |-------|----------------|---------------| | 99% | 7.2 hours | 3.65 days | | 99.9% | 43 minutes | 8.76 hours | | 99.95% | 22 minutes | 4.38 hours | | 99.99% | 4.3 minutes | 52 minutes |

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

Policy: | Budget Remaining | Action | |------------------|--------| | > 50% | Normal velocity | | 10-50% | Postpone risky changes | | < 10% | Freeze non-critical changes | | 0% | Feature freeze, fix reliability |

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | Punish individuals | Improve systems | | Hide information | Share learnings |

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

| Don't | Do Instead | |-------|------------| | Aim for 100% SLO | Accept error budget exists | | Skip small incidents | Small incidents reveal patterns | | Orphan action items | Every item needs owner + date + ticket | | Blame individuals | Ask "what conditions allowed this?" | | Create busywork actions | Actions should prevent recurrence |

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

Asmayaseen/operating-production-services

.claude/skills/operating-production-services/SKILL.md

SRE patterns for production service reliability: SLOs, error budgets, postmortems, and incident response. Use when defining reliability targets, writing postmortems, implementing SLO alerting, or establishing on-call practices. NOT for initial service development (use scaffolding skills instead).

1 stars

development

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add Asmayaseen/hackathon-2 operating-production-services

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 11:27 AM12.3s3 files scanned

SKILL.md

name:: operating-production-services
description:: |
SRE patterns for production service reliability:: SLOs, error budgets, postmortems, and incident response.

Operating Production Services

Production reliability patterns: measure what matters, learn from failures, improve systematically.

Quick Reference

| Need | Go To | |------|-------| | Define reliability targets | SLOs & Error Budgets | | Write incident report | Postmortem Templates | | Set up SLO alerting | references/slo-alerting.md |

SLOs & Error Budgets

The Hierarchy

SLA (Contract) → SLO (Target) → SLI (Measurement)

Common SLIs

# Availability: successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency: requests below threshold / total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

SLO Targets Reality Check

Don't aim for 100%. Each nine costs exponentially more.

Error Budget

Error Budget = 1 - SLO Target

Example: 99.9% SLO = 0.1% error budget = 43 minutes/month

See references/slo-alerting.md for Prometheus recording rules and multi-window burn rate alerts.

Postmortem Templates

The Blameless Principle

| Blame-Focused | Blameless | |---------------|-----------| | "Who caused this?" | "What conditions allowed this?" | | Punish individuals | Improve systems | | Hide information | Share learnings |

When to Write Postmortems

SEV1/SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes

Standard Template

# Postmortem: [Incident Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEVX

## Executive Summary
One paragraph: what happened, impact, root cause, resolution.

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | First alert fired |
| HH:MM | On-call acknowledged |
| HH:MM | Root cause identified |
| HH:MM | Fix deployed |
| HH:MM | Service recovered |

## Root Cause Analysis

### 5 Whys
1. Why did service fail? → [Answer]
2. Why did [1] happen? → [Answer]
3. Why did [2] happen? → [Answer]
4. Why did [3] happen? → [Answer]
5. Why did [4] happen? → [Root cause]

## Impact
- Customers affected: X
- Duration: X minutes
- Revenue impact: $X
- Support tickets: X

## Action Items
| Priority | Action | Owner | Due | Ticket |
|----------|--------|-------|-----|--------|
| P0 | [Immediate fix] | @name | Date | XXX-123 |
| P1 | [Prevent recurrence] | @name | Date | XXX-124 |
| P2 | [Improve detection] | @name | Date | XXX-125 |

Quick Template (Minor Incidents)

# Quick Postmortem: [Title]

**Date**: YYYY-MM-DD | **Duration**: X min | **Severity**: SEV3

## What Happened
One sentence description.

## Timeline
- HH:MM - Trigger
- HH:MM - Detection
- HH:MM - Resolution

## Root Cause
One sentence.

## Fix
- Immediate: [What was done]
- Long-term: [Ticket XXX-123]

Postmortem Meeting Guide

Structure (60 min)

Opening (5 min) - Remind: "We're here to learn, not blame"
Timeline (15 min) - Walk through events chronologically
Analysis (20 min) - What failed? Why? What allowed it?
Action Items (15 min) - Prioritize, assign owners, set dates
Closing (5 min) - Summarize learnings, confirm owners

Facilitation Tips

Redirect blame to systems: "What made this mistake possible?"
Time-box tangents
Document dissenting views
Encourage quiet participants

Anti-Patterns

Verification

Run: python scripts/verify.py

References

references/slo-alerting.md - Prometheus rules, burn rate alerts, Grafana dashboards

Related Skills

Asmayaseen/systematic-debugging

development

VerifiedTrustedCommunity

Systematic methodology for debugging bugs, test failures, and unexpected behavior. Use when encountering any technical issue before proposing fixes. Covers root cause investigation, pattern analysis, hypothesis testing, and fix implementation. Use ESPECIALLY when under time pressure, "just one quick fix" seems obvious, or you've already tried multiple fixes. NOT for exploratory code reading.

1SKILL.mdUpdated Apr 17, 2026

Asmayaseen/systematic-debugging

Asmayaseen/styling-with-shadcn

development

VerifiedTrustedCommunity

Build beautiful, accessible UIs with shadcn/ui components in Next.js. Use when creating forms, dialogs, tables, sidebars, or any UI components. Covers installation, component patterns, react-hook-form + Zod validation, and dark mode setup. NOT when building non-React applications or using different component libraries.

1SKILL.mdUpdated Apr 17, 2026

Asmayaseen/styling-with-shadcn

Asmayaseen/streaming-llm-responses

tools

VerifiedTrustedCommunity

Implement real-time streaming UI patterns for AI chat applications. Use when adding response lifecycle handlers, progress indicators, client effects, or thread state synchronization. Covers onResponseStart/End, onEffect, ProgressUpdateEvent, and client tools. NOT when building basic chat without real-time feedback.

1SKILL.mdUpdated Apr 17, 2026

Asmayaseen/streaming-llm-responses

Asmayaseen/scaffolding-openai-agents

tools

VerifiedTrustedCommunity

Builds AI agents using OpenAI Agents SDK with async/await patterns and multi-agent orchestration. Use when creating tutoring agents, building agent handoffs, implementing tool-calling agents, or orchestrating multiple specialists. Covers Agent class, Runner patterns, function tools, guardrails, and streaming responses. NOT when using raw OpenAI API without SDK or other agent frameworks like LangChain.

1SKILL.mdUpdated Apr 17, 2026

Asmayaseen/scaffolding-openai-agents

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/Asmayaseen/hackathon-2.git

# Copy into Claude Code skills folder (global)
cp -r hackathon-2/.claude/skills/operating-production-services ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

Asmayaseen/hackathon-2

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT