Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

thesoftwarehouse/tsh-implementing-observability

Name: tsh-implementing-observability
Author: thesoftwarehouse

.github/skills/tsh-implementing-observability/SKILL.md

npx skillsauth add thesoftwarehouse/copilot-collections tsh-implementing-observability

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Observability Patterns

When to Use

Setting up monitoring and alerting for applications
Implementing centralized logging
Adding distributed tracing to microservices
Designing SLOs/SLIs and error budgets
Creating dashboards and runbooks

Three Pillars of Observability

| Pillar | Purpose | Tools | |--------|---------|-------| | Metrics | Quantitative measurements over time | Prometheus, CloudWatch, Datadog, Grafana | | Logs | Discrete events with context | ELK, Loki, CloudWatch Logs, Splunk | | Traces | Request flow across services | Jaeger, Zipkin, X-Ray, Tempo |

Stack Detection

Check which observability stack the project uses:

prometheus.yml or ServiceMonitor → Prometheus
fluent-bit.conf or fluentd.conf → Fluent Bit/Fluentd
otel-collector-config.yaml → OpenTelemetry
AWS with aws_cloudwatch_* resources → CloudWatch
datadog-agent or DD_* env vars → Datadog

Use context7 to look up stack-specific configuration syntax.

Solution Decision Matrix

Metrics Stack

| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes-native, cost-sensitive | Prometheus + Grafana | | AWS-native, simple setup | CloudWatch Metrics | | Multi-cloud, enterprise | Datadog or New Relic | | OpenTelemetry-first | Prometheus with OTLP receiver |

Logging Stack

| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes, cost-sensitive | Loki + Grafana | | AWS-native | CloudWatch Logs | | High volume, complex queries | Elasticsearch (ELK) | | Multi-cloud, managed | Datadog Logs or Splunk |

Tracing Stack

| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes, open-source | Jaeger or Tempo | | AWS-native | X-Ray | | Multi-cloud, correlated | Datadog APM | | Vendor-agnostic | OpenTelemetry → any backend |

Kubernetes Observability Pattern

┌─────────────────────────────────────────────────────┐
│                   Applications                      │
│  (instrumented with OpenTelemetry SDK or auto-inst) │
└──────────────────────┬──────────────────────────────┘
                       │ OTLP
                       ▼
┌─────────────────────────────────────────────────────┐
│            OpenTelemetry Collector                  │
│  (receives, processes, exports telemetry)           │
└───────┬─────────────────┬─────────────────┬─────────┘
        │                 │                 │
        ▼                 ▼                 ▼
   Prometheus          Loki             Tempo/Jaeger
   (metrics)          (logs)            (traces)
        │                 │                 │
        └────────────────┬┴─────────────────┘
                         ▼
                      Grafana
                   (visualization)

SLO/SLI Framework

Key Metrics (RED Method for Services)

| Metric | Description | Example SLI | |--------|-------------|-------------| | Rate | Requests per second | rate(http_requests_total[5m]) | | Errors | Failed requests | rate(http_requests_total{status=~"5.."}[5m]) | | Duration | Latency distribution | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |

Key Metrics (USE Method for Resources)

| Metric | Description | Example | |--------|-------------|---------| | Utilization | % time resource is busy | CPU usage, memory usage | | Saturation | Queue depth, waiting | Pod pending, connection pool | | Errors | Error count | OOM kills, disk errors |

SLO Definition Template

# Example: API availability SLO
slo:
  name: api-availability
  description: "API returns successful responses"
  sli:
    metric: |
      sum(rate(http_requests_total{status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
  target: 99.9%
  window: 30d
  error_budget: 0.1%  # ~43 minutes/month downtime allowed

Alerting Strategy

Alert Severity Levels

| Severity | Response | Example | |----------|----------|---------| | Critical | Page on-call immediately | Service down, data loss risk | | Warning | Investigate within hours | Error rate elevated, disk 80% | | Info | Review during business hours | Deployment completed, scaling event |

Alert Quality Rules

Actionable: Every alert must have a clear response action
Relevant: Alert on symptoms (user impact), not causes
Unique: Avoid duplicate alerts for same incident
Timely: Alert early enough to prevent impact

Alert Template (Prometheus)

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "https://runbooks.example.com/high-error-rate"

Structured Logging

Log Format (JSON)

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "error": {
    "type": "PaymentGatewayError",
    "message": "Connection timeout"
  },
  "context": {
    "payment_id": "pay-123",
    "amount": 99.99
  }
}

Required Log Fields

| Field | Purpose | Correlation | |-------|---------|-------------| | timestamp | When event occurred | Time-based queries | | level | Severity (debug/info/warn/error) | Filtering | | service | Source service name | Service filtering | | trace_id | Distributed trace identifier | Cross-service correlation | | message | Human-readable description | Search |

Process

Discover context → Check existing observability setup (Prometheus, CloudWatch, etc.)
Choose stack → Use decision matrix based on environment and requirements
Instrument apps → Add OpenTelemetry SDK or auto-instrumentation
Configure collection → Set up collectors, exporters, and storage
Define SLOs → Establish SLIs, targets, and error budgets
Create alerts → Implement actionable alerts with runbooks
Build dashboards → Create service and infrastructure dashboards
Document runbooks → Write response procedures for each alert

Checklist

[ ] All services emit metrics, logs, and traces
[ ] Trace IDs propagated across service boundaries
[ ] Structured logging with consistent format (JSON)
[ ] SLOs defined with error budgets
[ ] Alerts are actionable with runbook links
[ ] Dashboards show service health at a glance
[ ] Log retention policy configured
[ ] PII/sensitive data excluded from logs
[ ] On-call rotation defined for critical alerts

Anti-Patterns

| Don't | Do | |-------|-----| | Alert on every metric threshold | Alert on user-impacting symptoms | | Log everything at DEBUG in production | Use appropriate log levels | | Unstructured log messages | Structured JSON logging | | Missing trace context | Propagate trace IDs across services | | Dashboards with 50+ panels | Focused dashboards per service/domain | | Alerts without runbooks | Every alert links to response procedure | | Store logs indefinitely | Define retention based on compliance needs |

Related Skills

tsh-implementing-kubernetes - For K8s-native observability setup
tsh-implementing-ci-cd - For pipeline observability integration
tsh-managing-secrets - For secure credential storage for observability tools

thesoftwarehouse/tsh-implementing-observability

.github/skills/tsh-implementing-observability/SKILL.md

Observability patterns for logging, monitoring, alerting, and distributed tracing. Use when implementing metrics collection, log aggregation, alerting rules, or distributed tracing across services.

206 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add thesoftwarehouse/copilot-collections tsh-implementing-observability

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 11:58 PM8.8s1 file scanned

SKILL.md

name:: tsh-implementing-observability
description:: Observability patterns for logging, monitoring, alerting, and distributed tracing. Use when implementing metrics collection, log aggregation, alerting rules, or distributed tracing across services.

Observability Patterns

When to Use

Setting up monitoring and alerting for applications
Implementing centralized logging
Adding distributed tracing to microservices
Designing SLOs/SLIs and error budgets
Creating dashboards and runbooks

Three Pillars of Observability

Stack Detection

Check which observability stack the project uses:

prometheus.yml or ServiceMonitor → Prometheus
fluent-bit.conf or fluentd.conf → Fluent Bit/Fluentd
otel-collector-config.yaml → OpenTelemetry
AWS with aws_cloudwatch_* resources → CloudWatch
datadog-agent or DD_* env vars → Datadog

Use context7 to look up stack-specific configuration syntax.

Solution Decision Matrix

Metrics Stack

Logging Stack

Tracing Stack

Kubernetes Observability Pattern

┌─────────────────────────────────────────────────────┐
│                   Applications                      │
│  (instrumented with OpenTelemetry SDK or auto-inst) │
└──────────────────────┬──────────────────────────────┘
                       │ OTLP
                       ▼
┌─────────────────────────────────────────────────────┐
│            OpenTelemetry Collector                  │
│  (receives, processes, exports telemetry)           │
└───────┬─────────────────┬─────────────────┬─────────┘
        │                 │                 │
        ▼                 ▼                 ▼
   Prometheus          Loki             Tempo/Jaeger
   (metrics)          (logs)            (traces)
        │                 │                 │
        └────────────────┬┴─────────────────┘
                         ▼
                      Grafana
                   (visualization)

SLO/SLI Framework

Key Metrics (RED Method for Services)

Key Metrics (USE Method for Resources)

SLO Definition Template

# Example: API availability SLO
slo:
  name: api-availability
  description: "API returns successful responses"
  sli:
    metric: |
      sum(rate(http_requests_total{status!~"5.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))
  target: 99.9%
  window: 30d
  error_budget: 0.1%  # ~43 minutes/month downtime allowed

Alerting Strategy

Alert Severity Levels

Alert Quality Rules

Actionable: Every alert must have a clear response action
Relevant: Alert on symptoms (user impact), not causes
Unique: Avoid duplicate alerts for same incident
Timely: Alert early enough to prevent impact

Alert Template (Prometheus)

groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "https://runbooks.example.com/high-error-rate"

Structured Logging

Log Format (JSON)

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-api",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user-789",
  "error": {
    "type": "PaymentGatewayError",
    "message": "Connection timeout"
  },
  "context": {
    "payment_id": "pay-123",
    "amount": 99.99
  }
}

Required Log Fields

Process

Discover context → Check existing observability setup (Prometheus, CloudWatch, etc.)
Choose stack → Use decision matrix based on environment and requirements
Instrument apps → Add OpenTelemetry SDK or auto-instrumentation
Configure collection → Set up collectors, exporters, and storage
Define SLOs → Establish SLIs, targets, and error budgets
Create alerts → Implement actionable alerts with runbooks
Build dashboards → Create service and infrastructure dashboards
Document runbooks → Write response procedures for each alert

Checklist

[ ] All services emit metrics, logs, and traces
[ ] Trace IDs propagated across service boundaries
[ ] Structured logging with consistent format (JSON)
[ ] SLOs defined with error budgets
[ ] Alerts are actionable with runbook links
[ ] Dashboards show service health at a glance
[ ] Log retention policy configured
[ ] PII/sensitive data excluded from logs
[ ] On-call rotation defined for critical alerts

Anti-Patterns

Related Skills

tsh-implementing-kubernetes - For K8s-native observability setup
tsh-implementing-ci-cd - For pipeline observability integration
tsh-managing-secrets - For secure credential storage for observability tools

Related Skills

thesoftwarehouse/tsh-writing-hooks

development

VerifiedTrustedCommunity

Custom hook and composable patterns — naming, composition, stable return shapes, lifecycle cleanup, and testing strategies. Use when writing reusable logic units (React hooks, Vue composables), refactoring logic into hooks, debugging hook behavior, or reviewing hook implementations.

206SKILL.mdUpdated Apr 15, 2026

thesoftwarehouse/tsh-writing-hooks

thesoftwarehouse/tsh-ui-verifying

testing

VerifiedTrustedCommunity

UI verification criteria, structure checklists, severity definitions, and tolerance rules for comparing implementations against Figma designs. Use for verifying UI matches design, understanding what to check, and determining acceptable differences.

206SKILL.mdUpdated Apr 15, 2026

thesoftwarehouse/tsh-ui-verifying

thesoftwarehouse/tsh-transcript-processing

development

VerifiedTrustedCommunity

Clean raw workshop or meeting transcripts from small talk, filler words, and off-topic tangents. Extract and structure business-relevant content into a standardized format with discussion topics, key decisions, action items, and open questions.

206SKILL.mdUpdated Apr 15, 2026

thesoftwarehouse/tsh-transcript-processing

thesoftwarehouse/tsh-technical-context-discovering

development

VerifiedTrustedCommunity

Discover and establish technical context before implementing any feature. Prioritize project instructions, existing codebase patterns, and external documentation in that order. Use for any task requiring understanding of project conventions, coding standards, architecture patterns, and established practices before writing code.

206SKILL.mdUpdated Apr 15, 2026

thesoftwarehouse/tsh-technical-context-discovering

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/thesoftwarehouse/copilot-collections.git

# Copy into Claude Code skills folder (global)
cp -r copilot-collections/.github/skills/tsh-implementing-observability ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

thesoftwarehouse/copilot-collections

206 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT