skills/observability/SKILL.md
Investigate production issues using logs, traces, and errors — how to triage, correlate signals, and know when to escalate.
npx skillsauth add athal7/dotfiles observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to investigate production problems. For query syntax, index patterns, and example queries, consult the log query reference.
Start with the symptom, not the tool. Before querying anything:
This prevents aimless log-scrolling and makes findings interpretable.
Work top-down — coarser signals first, drill into finer ones only when needed:
| Signal | What it tells you | When to use | |---|---|---| | Error rate / rate spike | Something broke at scale | First check — confirms the problem is real | | APM traces | Which transaction is slow or failing, full call chain | Once you know the scope | | APM errors | Exception type, stack trace, grouping key | When you need the root cause code path | | Logs | Raw context around a specific event | When traces don't have enough detail |
Don't start with logs. Start with traces or error groups, then use trace.id to pull the surrounding log context.
The trace.id field links all three indices (logs-*, traces-apm*, logs-apm.error-*). Once you have a trace.id from an error or slow trace, use it to pull all logs from that same request:
{"term": {"trace.id": "<trace-id-here>"}}
Before querying, write down what a "confirmed" answer looks like. Examples:
This prevents misreading absence of evidence as evidence of absence.
Stop investigating and escalate to the team when:
| Symptom | Where to look first |
|---|---|
| Slow page loads | APM traces — sort by transaction.duration.us desc |
| 500 errors spiking | APM errors — group by error.grouping_key |
| One user affected | Logs — filter by user ID or session ID |
| Periodic issue | Logs — look for time pattern in @timestamp |
| After a deploy | APM errors — filter by @timestamp after deploy time |
development
Zoom meeting captions — file locations and format
tools
macOS dictation custom vocabulary — sync knowledge base names and terms to the system spelling dictionary
testing
Look up people, projects, products, and decisions locally first: contact info (email, Slack ID, GitHub handle), titles and teams, project/product status, who works on what, and past decisions. Check before searching Slack, email, calendar, or GitHub — this is the first stop for any contact detail, project context, or decision-history question.
testing
Communication style, audience awareness, and AI-authorship markers for human-facing prose — load when composing chat messages, review comments, merge request descriptions, emails, doc bodies, or ticket descriptions