Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

peterbamuhigire/observability-monitoring

Name: observability-monitoring
Author: peterbamuhigire

observability-monitoring/SKILL.md

npx skillsauth add peterbamuhigire/skills-web-dev observability-monitoring

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Observability Monitoring

Use When

Use when designing or reviewing logs, metrics, traces, alerts, SLOs, dashboards, audit events, or production telemetry for web apps, APIs, SaaS platforms, mobile backends, and AI systems. Covers instrumentation strategy, diagnosis-first telemetry, alert quality, and operational visibility.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to observability-monitoring or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

References

Use the references/ directory for deep detail after reading the core workflow below.

Use this skill when a system must be diagnosable in production. It covers operational telemetry, not just analytics. The goal is to make failures understandable, actionable, and bounded.

Load Order

Load world-class-engineering.
Load this skill before finalizing architecture, APIs, jobs, or release design.
Pair it with deployment-release-engineering for rollout and incident visibility.

Observability Workflow

1. Identify Critical Flows

For each critical flow define:

trigger
expected success outcome
known failure modes
business impact if degraded
operator action if it fails

Instrument the highest-impact flows first.

2. Define Telemetry By Question

For every important signal, ask:

what question will this answer?
who needs the answer?
how quickly must they see it?

Use this mapping:

logs for detailed forensic context
metrics for trend, rate, saturation, and alerting
traces for multi-hop latency and dependency diagnosis
audit events for material business or security actions
profiles when CPU, memory, lock, or cost behavior matters

3. Design Correlation

Every request, job, and workflow should have:

request or trace ID
actor or service identity
tenant or ownership context where applicable
environment and version metadata
release marker or deploy version
dependency identity for important downstream calls

Without correlation, telemetry becomes noise.

4. Design High-Context Events

Prefer structured events with useful dimensions over sparse text strings.
Keep personally sensitive or secret fields out, but do not strip away the context needed to debug.
Be deliberate about cardinality. High-cardinality dimensions can be valuable when they answer real debugging questions.
Emit state transitions for long-running jobs and workflows so operators can reconstruct partial failure.

5. Define SLOs And Alerts

Use SLOs for user-facing reliability, not for every internal metric.

Define:

success metric
time window
target threshold
error budget implications

Alerts should page only when immediate action is required.

6. Build Dashboards For Diagnosis

Dashboards should answer:

what is broken?
who is affected?
where is the bottleneck?
what changed recently?
what should the operator do next?

Do not create vanity dashboards that cannot guide action.

Telemetry Standards

Logs

Use structured logs.
Include IDs, actor context, tenant context, route or job name, and result.
Log failures with enough context to debug, but never leak secrets or sensitive payloads.
Separate business audit logs from application diagnostics.

Metrics

Track:

request rate
error rate
latency percentiles
resource saturation
queue depth and lag
retry and fallback counts
cache hit rates where relevant
cost or token usage where relevant
saturation signals for pools, workers, rate limits, or thread usage

Prefer percentiles and rates over averages.

Traces

Trace:

requests crossing service or process boundaries
expensive background workflows
external dependencies
AI or retrieval pipelines with multiple stages
deploy markers and notable async transitions when the platform supports them

Audit Events

Audit events are required for:

auth and role changes
financial or ledger-affecting actions
entitlement changes
exports, deletions, and approvals
AI actions with external or privileged side effects

AI And Cost-Aware Telemetry

For AI-enabled systems, capture:

model, prompt version, and tool path
retrieval stages and source counts
token, latency, and cost budgets
eval outcomes or quality checks where available
fallback, refusal, and validation failures

Alert Design Rules

Page on symptoms that require immediate human action.
Ticket on trends or degradations that can wait.
Dashboard everything else.
Avoid alerts without a runbook path.
Include environment, service, version, impact, and likely first checks.

See references/alert-design.md.

Deliverables

For significant systems, produce:

telemetry map for critical flows
SLO definitions
alert list with severity and owner
dashboard outline
audit event list
trace and correlation ID strategy
cardinality and sensitive-data guardrails

Review Checklist

[ ] Critical flows have explicit telemetry.
[ ] IDs and tenant context are correlated across logs, metrics, and traces.
[ ] Events contain enough context to debug without unsafe data leakage.
[ ] Alerts map to operator action, not mere curiosity.
[ ] SLOs reflect user impact, not internal implementation trivia.
[ ] Audit events are defined for material actions.
[ ] Sensitive data is excluded or redacted from telemetry.

References

references/alert-design.md: Alert severity and routing rules.
references/diagnosis-first-observability.md: Event design, cardinality, release markers, and AI telemetry.
references/slo-template.md: SLO template and service questions.

peterbamuhigire/observability-monitoring

observability-monitoring/SKILL.md

Use when designing or reviewing logs, metrics, traces, alerts, SLOs, dashboards, audit events, or production telemetry for web apps, APIs, SaaS platforms, mobile backends, and AI systems. Covers instrumentation strategy, diagnosis-first telemetry, alert quality, and operational visibility.

8 stars

development

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add peterbamuhigire/skills-web-dev observability-monitoring

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 9:15 AM9.5s4 files scanned

SKILL.md

name:: observability-monitoring
description:: Use when designing or reviewing logs, metrics, traces, alerts, SLOs,
portable:: true

Observability Monitoring

Use When

Use when designing or reviewing logs, metrics, traces, alerts, SLOs, dashboards, audit events, or production telemetry for web apps, APIs, SaaS platforms, mobile backends, and AI systems. Covers instrumentation strategy, diagnosis-first telemetry, alert quality, and operational visibility.
The task needs reusable judgment, domain constraints, or a proven workflow rather than ad hoc advice.

Do Not Use When

The task is unrelated to observability-monitoring or would be better handled by a more specific companion skill.
The request only needs a trivial answer and none of this skill's constraints or references materially help.

Required Inputs

Gather relevant project context, constraints, and the concrete problem to solve; load references only as needed.
Confirm the desired deliverable: design, code, review, migration plan, audit, or documentation.

Workflow

Read this SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.
Apply the ordered guidance, checklists, and decision rules in this skill instead of cherry-picking isolated snippets.
Produce the deliverable with assumptions, risks, and follow-up work made explicit when they matter.

Quality Standards

Keep outputs execution-oriented, concise, and aligned with the repository's baseline engineering standards.
Preserve compatibility with existing project conventions unless the skill explicitly requires a stronger standard.
Prefer deterministic, reviewable steps over vague advice or tool-specific magic.

Anti-Patterns

Treating examples as copy-paste truth without checking fit, constraints, or failure modes.
Loading every reference file by default instead of using progressive disclosure.

Outputs

A concrete result that fits the task: implementation guidance, review findings, architecture decisions, templates, or generated artifacts.
Clear assumptions, tradeoffs, or unresolved gaps when the task cannot be completed from available context alone.
References used, companion skills, or follow-up actions when they materially improve execution.

References

Use the references/ directory for deep detail after reading the core workflow below.

Use this skill when a system must be diagnosable in production. It covers operational telemetry, not just analytics. The goal is to make failures understandable, actionable, and bounded.

Load Order

Load world-class-engineering.
Load this skill before finalizing architecture, APIs, jobs, or release design.
Pair it with deployment-release-engineering for rollout and incident visibility.

Observability Workflow

1. Identify Critical Flows

For each critical flow define:

trigger
expected success outcome
known failure modes
business impact if degraded
operator action if it fails

Instrument the highest-impact flows first.

2. Define Telemetry By Question

For every important signal, ask:

what question will this answer?
who needs the answer?
how quickly must they see it?

Use this mapping:

logs for detailed forensic context
metrics for trend, rate, saturation, and alerting
traces for multi-hop latency and dependency diagnosis
audit events for material business or security actions
profiles when CPU, memory, lock, or cost behavior matters

3. Design Correlation

Every request, job, and workflow should have:

request or trace ID
actor or service identity
tenant or ownership context where applicable
environment and version metadata
release marker or deploy version
dependency identity for important downstream calls

Without correlation, telemetry becomes noise.

4. Design High-Context Events

Prefer structured events with useful dimensions over sparse text strings.
Keep personally sensitive or secret fields out, but do not strip away the context needed to debug.
Be deliberate about cardinality. High-cardinality dimensions can be valuable when they answer real debugging questions.
Emit state transitions for long-running jobs and workflows so operators can reconstruct partial failure.

5. Define SLOs And Alerts

Use SLOs for user-facing reliability, not for every internal metric.

Define:

success metric
time window
target threshold
error budget implications

Alerts should page only when immediate action is required.

6. Build Dashboards For Diagnosis

Dashboards should answer:

what is broken?
who is affected?
where is the bottleneck?
what changed recently?
what should the operator do next?

Do not create vanity dashboards that cannot guide action.

Telemetry Standards

Logs

Use structured logs.
Include IDs, actor context, tenant context, route or job name, and result.
Log failures with enough context to debug, but never leak secrets or sensitive payloads.
Separate business audit logs from application diagnostics.

Metrics

Track:

request rate
error rate
latency percentiles
resource saturation
queue depth and lag
retry and fallback counts
cache hit rates where relevant
cost or token usage where relevant
saturation signals for pools, workers, rate limits, or thread usage

Prefer percentiles and rates over averages.

Traces

Trace:

requests crossing service or process boundaries
expensive background workflows
external dependencies
AI or retrieval pipelines with multiple stages
deploy markers and notable async transitions when the platform supports them

Audit Events

Audit events are required for:

auth and role changes
financial or ledger-affecting actions
entitlement changes
exports, deletions, and approvals
AI actions with external or privileged side effects

AI And Cost-Aware Telemetry

For AI-enabled systems, capture:

model, prompt version, and tool path
retrieval stages and source counts
token, latency, and cost budgets
eval outcomes or quality checks where available
fallback, refusal, and validation failures

Alert Design Rules

Page on symptoms that require immediate human action.
Ticket on trends or degradations that can wait.
Dashboard everything else.
Avoid alerts without a runbook path.
Include environment, service, version, impact, and likely first checks.

See references/alert-design.md.

Deliverables

For significant systems, produce:

telemetry map for critical flows
SLO definitions
alert list with severity and owner
dashboard outline
audit event list
trace and correlation ID strategy
cardinality and sensitive-data guardrails

Review Checklist

[ ] Critical flows have explicit telemetry.
[ ] IDs and tenant context are correlated across logs, metrics, and traces.
[ ] Events contain enough context to debug without unsafe data leakage.
[ ] Alerts map to operator action, not mere curiosity.
[ ] SLOs reflect user impact, not internal implementation trivia.
[ ] Audit events are defined for material actions.
[ ] Sensitive data is excluded or redacted from telemetry.

References

references/alert-design.md: Alert severity and routing rules.
references/diagnosis-first-observability.md: Event design, cardinality, release markers, and AI telemetry.
references/slo-template.md: SLO template and service questions.

Related Skills

peterbamuhigire/ai-analytics-saas

data-ai

VerifiedTrustedCommunity

Use when adding AI-powered analytics to a SaaS platform — semantic search over business data, natural language queries, trend detection, anomaly alerts, and AI-generated insights for dashboards. Covers embeddings, NL2SQL, and per-tenant analytics...

10SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/ai-analytics-saas

peterbamuhigire/ai-analytics-dashboards

data-ai

VerifiedTrustedCommunity

Design AI-powered analytics dashboards — what metrics to show, how to display AI predictions and confidence, drill-down patterns, KPI cards, trend visualisation, AI Insights panels, export design, and role-based dashboard variants. Invoke when...

9SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/ai-analytics-dashboards

peterbamuhigire/world-class-engineering

development

VerifiedTrustedCommunity

Use when designing, building, reviewing, or upgrading production software systems that must be secure, performant, maintainable, scalable, and user-centered. Apply before writing specs, code, architecture, APIs, databases, mobile apps, SaaS platforms, or ERP systems.

8SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/world-class-engineering

peterbamuhigire/webapp-gui-design

development

VerifiedTrustedCommunity

Professional web app UI using commercial templates (Tabler/Bootstrap 5) with strong frontend design direction when needed. Use for CRUD interfaces, dashboards, admin panels with SweetAlert2, DataTables, Flatpickr. Clone seeder-page.php, use...

8SKILL.mdUpdated Apr 15, 2026

peterbamuhigire/webapp-gui-design

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/peterbamuhigire/skills-web-dev.git

# Copy into Claude Code skills folder (global)
cp -r skills-web-dev/observability-monitoring ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

peterbamuhigire/skills-web-dev

8 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT