plugin/skills/observability-methods/SKILL.md
Use this skill when picking a diagnostic vocabulary for a latency, error-rate, saturation, or crashloop investigation, when authoring or reviewing an analysis workflow that needs the method reference, or when correlating SLI metrics and active alerts against a method's signal set — a knowledge skill of industry-canonical observability methodologies (Four Golden Signals, RED, USE, Distributed Tracing) with method-to-problem mapping that explains which signals each method surfaces and when each method applies. Loaded by `/analyze-prod`, `/analyze-local`, `/env-analyze`, `/infra-change`, and `/bugfix` workflows when production-context diagnosis needs a named methodology.
npx skillsauth add avav25/ai-assets observability-methodsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference for the four industry-canonical observability methodologies used by production-analysis and incident-response workflows in this plugin. Pick one named method per problem class to give analysis a vocabulary handle and a defined signal coverage. Cross-reference SLI metrics, error-budget burn, and active alerts against the chosen method's signals.
Source: SRE Book Ch. 6 — Monitoring Distributed Systems.
User-facing service monitoring. Tracks the four signals a user-visible service must expose:
Apply when: the service is user-facing, has a defined SLO, or owns an SLI.
Source: RED Method.
Request-driven microservices. Three signals per service:
Apply when: diagnosing a request-response microservice, especially with many small services where per-service uniformity matters more than resource depth. Strong fit for slow-API and 5xx-spike investigations.
Source: USE Method.
Resource-driven. For every resource (CPU, memory, disk, network, queue), check three signals:
Apply when: resource exhaustion is suspected — OOMKilled pods, CPU throttling, disk pressure, node-level pressure, container crashloops driven by limits.
For latency or cross-service failures, a single trace through 5–7 services is the canonical root-cause path:
service.name, trace.id, or slow-trace heatmap. Honeycomb BubbleUp narrows attributes that correlate with slow traces.Apply when: a request crosses ≥2 services, latency is high but no single service is obviously saturated, error blame is unclear, or aggregate metrics show the symptom but not the cause.
| Problem | Method | Why | |---|---|---| | Slow API | RED | Duration p99 tail is the surfaced signal | | 5xx spike | Golden Signals | Errors + Saturation cover cause and capacity together | | OOMKilled / crashloop | USE | Memory/CPU saturation + resource errors | | Customer-reported latency | RED + Distributed Tracing | RED localizes the slow service, tracing finds the slow span | | Node / disk pressure | USE | Resource-axis Utilization + Saturation | | Cross-service failure (no single hotspot) | Distributed Tracing | Single trace reveals the failing hop | | New service with SLO | Golden Signals | Establish baseline for all four user-facing signals |
| Workflow | Apply this knowledge |
|---|---|
| /analyze-prod (snapshot phase) | Pick a named method per problem class before running queries; cross-reference SLI/SLO and alerts against the method's signals |
| /analyze-local (Docker logs) | Apply USE on container resource limits when local services crashloop or are OOMKilled |
| /env-analyze (multi-scope) | Use Golden Signals or RED to frame service-level findings; USE for node/resource layer |
| /infra-change (post-apply verify) | Use Golden Signals to confirm SLO is not regressed by the change |
| /bugfix (production-context bugs) | Use RED + tracing to localize the failing service / span before code-level investigation |
/analyze-prod, /analyze-local, /env-analyze, /infra-change, /bugfix@telemetry-stacks (vendor-specific queries that surface these signals), @cloud-platforms (managed-service metric sources), @deployment-procedures (post-deploy health verification)development
Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.
development
Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.
tools
Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.
tools
Use this skill when extending, repairing, or improving plugin assets, when ingesting a `/feedback` report as a fix-cycle backlog, or when you do not remember which lower-level command is right for the job — the umbrella workflow for ai-skills plugin-asset authoring and maintenance: creating, auditing, fixing, improving, refactoring, and migrating skills, agents, rules, hooks, prompts, schemas, and rubrics inside the plugin. Auto-classifies the request, loads the right knowledge skills (`@prompt-engineering`, `@context-engineering`, `@team-protocols`), and spawns the right subagents (`prompt-engineer`, `system-architect`, `python-engineer`, `software-engineer`, `qa-engineer`, `eval-judge`) via the `Agent` tool.