plugin/skills/analyze-prod/SKILL.md
Use this skill when investigating a production incident, when an alert fires (latency spike, error rate, pod crashloop), when a customer-reported issue needs prod telemetry, or as the diagnosis step of an incident-response or production-bugfix flow — including when the user describes a prod symptom without asking to "analyze" — to analyze the production environment by collecting Kubernetes pod status, managed database health, logs, metrics, and networking and diagnosing issues, supporting GCP, Azure, and AWS via the `cloud-platforms` skill and applying the SRE or DevOps role.
npx skillsauth add avav25/ai-assets analyze-prodInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic production-environment analysis. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Standalone or as the entry point for production bugfixing.
Cloud platform detection: Read CLAUDE.md to identify the cloud platform, then load the matching reference module (GCP / Azure / AWS) from @cloud-platforms.
⚠️ SAFETY: READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.
Ask the user:
If invoked as part of a bugfix / incident flow, extract context from the parent conversation instead.
Select role by problem type:
| Problem Type | Primary Role |
|---|---|
| Pod crashes, restarts, health-check failures | Agent(sre-engineer) |
| High latency, error-rate spikes, SLO burn | Agent(sre-engineer) |
| Managed DB issues (connections, replication, CPU) | Agent(sre-engineer) |
| Networking, ingress, DNS, connectivity | Agent(sre-engineer) + Agent(devops-engineer) |
| Deployment failures, rollback needed | Agent(devops-engineer) |
| Terraform / infra drift, resource config | Agent(devops-engineer) |
| Application errors in logs | Stack-specific (Agent(java-engineer) / Agent(python-engineer) / Agent(frontend-engineer)) |
| General / unclear | Agent(sre-engineer) |
Announce the applied role(s). For P1/P2 incidents, always apply Agent(sre-engineer).
Detect platform from CLAUDE.md and verify authentication via @cloud-platforms:
gcloud config get-value project + gcloud auth listaz account show + az aks listaws sts get-caller-identity + aws eks list-clustersConfirm the active project / subscription / account matches the user's target.
// turbo
kubectl config current-context
kubectl cluster-info
Run the platform-specific cluster list command from @cloud-platforms to verify cluster status.
Record: Cluster name, location, version, node count, current kubectl context.
Run the following read-only diagnostics. Present results as a structured summary.
// turbo
kubectl get nodes -o wide
kubectl top nodes
Flag: NotReady nodes, CPU/memory >80%, version skew.
For the affected namespace (or all namespaces if unspecified):
// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
Flag: CrashLoopBackOff / Error / Pending / ImagePullBackOff, restart count >3 in last hour, replica count mismatch (kubectl get deployments).
kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>
Look for: OOMKilled (exit 137), app exceptions, connection errors, startup failures, readiness probe failures.
// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>
Flag: Pods near memory limits, HPA at max replicas, CPU throttling.
DB diagnostic commands per cloud — see @cloud-platforms. Key metrics: CPU / memory / disk utilization, active vs max connections, replication lag (HA / read replicas), failed connection count.
// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.
// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal
Record: Warning / Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.
Observability methodology (Golden Signals / RED / USE / Distributed Tracing) — see @observability-methods. Pick a named method per problem class, then cross-reference SLI metrics, error-budget burn, and active alerts against that method's signals.
Telemetry stack patterns + per-vendor queries — see @telemetry-stacks. Identify the stack from CLAUDE.md, helm charts, prometheus-operator CRDs, or the OTel collector config, then query directly using the vendor patterns documented there.
Using the applied role's expertise:
Agent(sre-engineer) is active): error budget consumed, burn rate.<common_prod_issues>
Structure the diagnosis:
## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s: [version] | Managed DB: [instance] ([state], [tier])
## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]% | Burn rate: [Nx] | Time to exhaustion: [duration]
## Findings
### [Issue 1: title]
- **Symptom / Evidence / Root cause / Severity (P1–P4) / Blast radius**
## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]
## Environment Health: [HEALTHY | DEGRADED | OUTAGE]
⚠️ All production mutations require explicit user approval.
/analyze-local), deploy via normal CI/CD.Agent(devops-engineer) — never direct apply.After any fix, re-run relevant Step 4 commands to verify resolution.
/bugfix (production environment diagnostics)Agent(sre-engineer), Agent(devops-engineer)@cloud-platforms (platform-specific CLI commands, managed service diagnostics), @observability-methods (Golden Signals / RED / USE / Tracing), @telemetry-stacks (Prometheus / Datadog / Honeycomb / New Relic / Sentry / OTel queries)development
Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.
development
Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.
tools
Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.
tools
Use this skill when extending, repairing, or improving plugin assets, when ingesting a `/feedback` report as a fix-cycle backlog, or when you do not remember which lower-level command is right for the job — the umbrella workflow for ai-skills plugin-asset authoring and maintenance: creating, auditing, fixing, improving, refactoring, and migrating skills, agents, rules, hooks, prompts, schemas, and rubrics inside the plugin. Auto-classifies the request, loads the right knowledge skills (`@prompt-engineering`, `@context-engineering`, `@team-protocols`), and spawns the right subagents (`prompt-engineer`, `system-architect`, `python-engineer`, `software-engineer`, `qa-engineer`, `eval-judge`) via the `Agent` tool.