Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

akshay-na/observability-as-design

Name: observability-as-design
Author: akshay-na

ai/cursor-tech-team/skills/observability-as-design/SKILL.md

npx skillsauth add akshay-na/dotfiles observability-as-design

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Observability as Design

Overview

Design systems that are measurable, debuggable, and operationally transparent. Observability is not an afterthought -- it is a design constraint. If you cannot measure it, you cannot trust it.

When to Use

Before implementing a new service or endpoint.
When defining SLOs or alerting thresholds.
When structured logging needs a schema.
When tracing boundaries between services are unclear.
When an outage reveals missing signals.
When debugging requires guesswork instead of data.

When NOT to Use

Adding metrics to throwaway prototypes.
Instrumenting code paths with no production traffic.

Core Competencies

| Competency | Key Question | |---|---| | RED metrics | Are rate, error rate, and duration measured for every entry point? | | Structured logging | Do logs carry request ID, user context, and operation outcome? | | SLO candidates | What is the availability and latency target? How is it measured? | | Leading indicators | What signals warn of failure before users notice? | | Tracing boundaries | Can a request be traced across every service it touches? | | Health checks | Do readiness and liveness probes reflect actual capability? |

Practice Loop

Define metrics before coding -- decide what to measure as part of the design, not after deployment.
Simulate failure -- break a dependency and verify that the right signals fire.
Test 2AM scenarios -- can an on-call engineer diagnose the issue from dashboards and logs alone?
Improve signal clarity -- refine metrics and log fields that were ambiguous during debugging.
Remove noise -- eliminate redundant or low-value metrics that obscure real signals.

Failure Signals

Stop and reassess if you observe:

| Signal | Root Cause | |---|---| | Guesswork debugging | Missing metrics or unstructured logs at failure points | | Missing metrics during outage | Instrumentation added reactively, not by design | | Logs without context | No request ID, no user context, no operation outcome | | No early-warning indicators | Only alerting on failures, not on degradation trends | | Alert fatigue | Too many noisy alerts; real signals drowned out | | Health check lies | Probe returns healthy while the service cannot serve traffic |

Quick Reference

Observability Design Template:

## RED Metrics (per entry point)
- Rate: requests per second
- Errors: error rate and error type breakdown
- Duration: latency at p50, p95, p99

## Structured Log Schema
- timestamp (ISO 8601)
- level (info, warn, error)
- request_id (trace correlation)
- user_id (who)
- operation (what)
- outcome (success, failure, degraded)
- duration_ms (how long)
- error_message (if failed)
- context (domain-specific fields)

## SLO Candidates
- Availability: % of successful responses over time window
- Latency: % of requests under target duration
- Error budget: remaining tolerance before SLO breach

## Health Checks
- Liveness: is the process running and not deadlocked?
- Readiness: can the service accept and process requests?
- Dependency checks: are critical downstream services reachable?

## Leading Indicators
- Queue depth trending upward
- Latency percentile drift (p99 rising while p50 is stable)
- Error rate increase on a single dependency
- Connection pool utilization approaching limit
- Memory or CPU approaching saturation

Common Mistakes

| Mistake | Fix | |---|---| | Adding metrics after the first outage | Define metrics as part of design, before writing code | | Logging free-form strings | Use structured logging with a consistent schema | | Health check that only pings | Verify actual capability: database reachable, queue writable | | Alerting on every error | Alert on error rate thresholds, not individual errors | | No request correlation | Propagate a request ID through every service boundary | | Measuring only averages | Averages hide tail latency; always measure p95 and p99 |

akshay-na/observability-as-design

ai/cursor-tech-team/skills/observability-as-design/SKILL.md

Use when defining metrics before implementation, designing structured logging, setting SLO targets, instrumenting tracing boundaries, building health checks, or when an outage reveals missing signals and blind spots in operational visibility.

4 stars

development

Updated May 9, 2026

$ install --global

skillsauth

npx skillsauth add akshay-na/dotfiles observability-as-design

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 2, 2026, 5:31 PM64.0s1 file scanned

SKILL.md

name:: observability-as-design
description:: Use when defining metrics before implementation, designing structured logging, setting SLO targets, instrumenting tracing boundaries, building health checks, or when an outage reveals missing signals and blind spots in operational visibility.

Observability as Design

Overview

Design systems that are measurable, debuggable, and operationally transparent. Observability is not an afterthought -- it is a design constraint. If you cannot measure it, you cannot trust it.

When to Use

Before implementing a new service or endpoint.
When defining SLOs or alerting thresholds.
When structured logging needs a schema.
When tracing boundaries between services are unclear.
When an outage reveals missing signals.
When debugging requires guesswork instead of data.

When NOT to Use

Adding metrics to throwaway prototypes.
Instrumenting code paths with no production traffic.

Core Competencies

Practice Loop

Define metrics before coding -- decide what to measure as part of the design, not after deployment.
Simulate failure -- break a dependency and verify that the right signals fire.
Test 2AM scenarios -- can an on-call engineer diagnose the issue from dashboards and logs alone?
Improve signal clarity -- refine metrics and log fields that were ambiguous during debugging.
Remove noise -- eliminate redundant or low-value metrics that obscure real signals.

Failure Signals

Stop and reassess if you observe:

Quick Reference

Observability Design Template:

## RED Metrics (per entry point)
- Rate: requests per second
- Errors: error rate and error type breakdown
- Duration: latency at p50, p95, p99

## Structured Log Schema
- timestamp (ISO 8601)
- level (info, warn, error)
- request_id (trace correlation)
- user_id (who)
- operation (what)
- outcome (success, failure, degraded)
- duration_ms (how long)
- error_message (if failed)
- context (domain-specific fields)

## SLO Candidates
- Availability: % of successful responses over time window
- Latency: % of requests under target duration
- Error budget: remaining tolerance before SLO breach

## Health Checks
- Liveness: is the process running and not deadlocked?
- Readiness: can the service accept and process requests?
- Dependency checks: are critical downstream services reachable?

## Leading Indicators
- Queue depth trending upward
- Latency percentile drift (p99 rising while p50 is stable)
- Error rate increase on a single dependency
- Connection pool utilization approaching limit
- Memory or CPU approaching saturation

Common Mistakes

Related Skills

akshay-na/team-discovery

development

VerifiedTrustedCommunity

Discovery + naming convention reference for typed dev/SME/QA/devops team members in any workspace folder. Primary consumer: `tech-lead` (org-tier).

4SKILL.mdUpdated May 28, 2026

akshay-na/team-discovery

akshay-na/task-orchestration

devops

VerifiedTrustedCommunity

Automated task classification, agent selection, and state tracking. Use when routing tasks to agents, selecting pipelines, or managing task state.

4SKILL.mdUpdated May 28, 2026

akshay-na/task-orchestration

akshay-na/systems-design-depth

testing

VerifiedTrustedCommunity

Use when designing scalable systems, evaluating consistency models, planning state management, making architectural decisions, or when trade-offs around coupling, failure isolation, and reversibility need explicit reasoning before implementation.

4SKILL.mdUpdated May 28, 2026

akshay-na/systems-design-depth

akshay-na/swarm-task-decomposition

tools

VerifiedTrustedCommunity

CTO/tech-lead helper — split work into disjoint shard briefs with caps (instance_cap, partition_basis, determinism keys).

4SKILL.mdUpdated May 28, 2026

akshay-na/swarm-task-decomposition

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/akshay-na/dotfiles.git

# Copy into Claude Code skills folder (global)
cp -r dotfiles/ai/cursor-tech-team/skills/observability-as-design ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

akshay-na/dotfiles

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT