Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dsivov/production-monitoring

Name: production-monitoring
Author: dsivov

.claude/skills/production-monitoring/SKILL.md

npx skillsauth add dsivov/ai_development_team production-monitoring

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Production Monitoring Expertise

When This Applies

Apply this guidance when:

Assessing whether a release is production-ready
Setting up or reviewing health checks
Evaluating system stability after a deployment
Defining monitoring and alerting requirements
Responding to production incidents

Health Check Design

Application Health Endpoint

Every service should expose a health endpoint:

GET /health
Response:
{
  "status": "healthy" | "degraded" | "unhealthy",
  "version": "1.2.3",
  "timestamp": "2026-02-28T10:00:00Z",
  "checks": {
    "database": "healthy",
    "cache": "healthy",
    "external_api": "degraded"
  }
}

Health Check Levels

| Level | What to Check | Frequency | |-------|--------------|-----------| | Liveness | Process is running | Every 10s | | Readiness | Can handle requests | Every 30s | | Deep health | All dependencies ok | Every 60s |

Key Metrics to Monitor

The Four Golden Signals

Latency — Response time distribution (p50, p95, p99)
Traffic — Request rate (requests/second)
Errors — Error rate (5xx responses / total)
Saturation — Resource utilization (CPU, memory, disk, connections)

Service-Level Indicators (SLIs)

| SLI | Target | Measurement | |-----|--------|-------------| | Availability | 99.9% | Successful responses / total | | Latency | p95 < 200ms | Response time percentiles | | Error rate | < 0.1% | Error responses / total | | Throughput | > N req/s | Requests per second |

Post-Deployment Verification

After merging to main and deploying:

Immediate (0-5 minutes)

[ ] Health endpoint returns healthy
[ ] No error spike in logs
[ ] Response times are within normal range
[ ] Key user flows work (manual smoke test)

Short-term (5-30 minutes)

[ ] Error rate is stable or improving
[ ] No memory leaks (memory usage stable)
[ ] No connection pool exhaustion
[ ] No increase in support tickets or alerts

Medium-term (30 min - 2 hours)

[ ] Performance metrics are within historical norms
[ ] Background jobs are processing normally
[ ] No data integrity issues reported
[ ] All scheduled tasks ran successfully

Alerting Guidelines

Alert Severity

| Severity | Condition | Response | |----------|-----------|----------| | Critical | Service down, data loss risk | Page on-call, immediate action | | Warning | Degraded performance, elevated errors | Investigate within 30 min | | Info | Anomaly detected, approaching threshold | Review during business hours |

Alert Design Rules

Actionable — Every alert should have a clear response action
No noise — If an alert fires > 5 times/day without action needed, fix or remove it
Contextual — Include what's wrong, what threshold was breached, and where to look
Escalation — If not acknowledged within N minutes, escalate to next level

Incident Documentation

After any production incident, document in reports/INCIDENT_<YYYYMMDD>_<NNN>.md:

# Incident Report — <date>

## Summary
<One-line description of what happened>

## Timeline
- HH:MM — Issue detected
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Verified resolved

## Impact
- Duration: X minutes
- Users affected: N
- Data impact: None / Describe

## Root Cause
<What caused the issue>

## Resolution
<How it was fixed>

## Prevention
<What changes will prevent recurrence>

dsivov/production-monitoring

.claude/skills/production-monitoring/SKILL.md

Use when the Production Engineer is assessing production health, setting up monitoring, defining health checks, evaluating SLA compliance, or responding to incidents. Activates when discussing production readiness, system health, alerting, or incident response.

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add dsivov/ai_development_team production-monitoring

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 2:51 PM4.4s1 file scanned

SKILL.md

name:: production-monitoring
description:: Use when the Production Engineer is assessing production health, setting up monitoring, defining health checks, evaluating SLA compliance, or responding to incidents. Activates when discussing production readiness, system health, alerting, or incident response.
version:: 1.0.0

Production Monitoring Expertise

When This Applies

Apply this guidance when:

Assessing whether a release is production-ready
Setting up or reviewing health checks
Evaluating system stability after a deployment
Defining monitoring and alerting requirements
Responding to production incidents

Health Check Design

Application Health Endpoint

Every service should expose a health endpoint:

GET /health
Response:
{
  "status": "healthy" | "degraded" | "unhealthy",
  "version": "1.2.3",
  "timestamp": "2026-02-28T10:00:00Z",
  "checks": {
    "database": "healthy",
    "cache": "healthy",
    "external_api": "degraded"
  }
}

Health Check Levels

Key Metrics to Monitor

The Four Golden Signals

Latency — Response time distribution (p50, p95, p99)
Traffic — Request rate (requests/second)
Errors — Error rate (5xx responses / total)
Saturation — Resource utilization (CPU, memory, disk, connections)

Service-Level Indicators (SLIs)

Post-Deployment Verification

After merging to main and deploying:

Immediate (0-5 minutes)

[ ] Health endpoint returns healthy
[ ] No error spike in logs
[ ] Response times are within normal range
[ ] Key user flows work (manual smoke test)

Short-term (5-30 minutes)

[ ] Error rate is stable or improving
[ ] No memory leaks (memory usage stable)
[ ] No connection pool exhaustion
[ ] No increase in support tickets or alerts

Medium-term (30 min - 2 hours)

[ ] Performance metrics are within historical norms
[ ] Background jobs are processing normally
[ ] No data integrity issues reported
[ ] All scheduled tasks ran successfully

Alerting Guidelines

Alert Severity

Alert Design Rules

Actionable — Every alert should have a clear response action
No noise — If an alert fires > 5 times/day without action needed, fix or remove it
Contextual — Include what's wrong, what threshold was breached, and where to look
Escalation — If not acknowledged within N minutes, escalate to next level

Incident Documentation

After any production incident, document in reports/INCIDENT_<YYYYMMDD>_<NNN>.md:

# Incident Report — <date>

## Summary
<One-line description of what happened>

## Timeline
- HH:MM — Issue detected
- HH:MM — Investigation started
- HH:MM — Root cause identified
- HH:MM — Fix applied
- HH:MM — Verified resolved

## Impact
- Duration: X minutes
- Users affected: N
- Data impact: None / Describe

## Root Cause
<What caused the issue>

## Resolution
<How it was fixed>

## Prevention
<What changes will prevent recurrence>

Related Skills

dsivov/test-engineering

development

VerifiedTrustedCommunity

Use when the Integrator is writing unit tests, e2e tests, designing test strategies, improving test coverage, creating test fixtures, or mocking dependencies. Activates for any testing-related work including TDD, test refactoring, or test debugging.

SKILL.mdUpdated Apr 16, 2026

dsivov/test-engineering

dsivov/task-decomposition

development

VerifiedTrustedCommunity

Use when the Architect is breaking down change requests into implementable tasks, defining acceptance criteria, estimating task size, mapping dependencies, or creating technical sub-tasks for Developer and Integrator.

SKILL.mdUpdated Apr 16, 2026

dsivov/task-decomposition

dsivov/system-design

development

VerifiedTrustedCommunity

Use when the Architect is designing system architecture, choosing technology stacks, defining data models, designing APIs, making scalability decisions, or updating ARCHITECTURE.md. Activates for any architecture design, technology evaluation, or system structure discussion.

SKILL.mdUpdated Apr 16, 2026

dsivov/stakeholder-communication

documentation

VerifiedTrustedCommunity

Use when the Manager is writing status updates, daily reports, queue messages to team members, escalation notices, or cross-role coordination messages. Activates when composing any team communication, reports, or documentation updates.

SKILL.mdUpdated Apr 16, 2026

dsivov/stakeholder-communication

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dsivov/ai_development_team.git

# Copy into Claude Code skills folder (global)
cp -r ai_development_team/.claude/skills/production-monitoring ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dsivov/ai_development_team

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT