.claude/skills/analyze-prod/SKILL.md
Analyze production environment — collect Kubernetes pod status, managed database health, logs, metrics, networking, and diagnose issues. Supports GCP, Azure, and AWS via the `cloud-platforms` skill. Applies SRE or DevOps roles. Use standalone or as part of production incident investigation.
npx skillsauth add avav25/ai-assets analyze-prodInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic analysis of the production environment. Collects cluster status, pod health, database metrics, logs, and diagnoses issues. Works standalone or as an entry point for production bugfixing.
Cloud platform detection: Read CLAUDE.md to identify which cloud platform is used. Consult cloud-platforms skill and load the corresponding reference module (GCP, Azure, or AWS) for platform-specific CLI commands.
⚠️ SAFETY: This workflow runs READ-ONLY commands only. No mutations (scale, delete, restart, deploy) without explicit user approval at Step 6.
Ask the user:
If invoked as part of a bugfix/incident flow — extract context from the parent conversation instead of asking.
Select and apply the role based on the problem type:
| Problem Type | Primary Role | Rationale |
|---|---|---|
| Pod crashes, restarts, health check failures | Agent(sre-engineer) | Reliability, K8s troubleshooting, SLO impact |
| High latency, error rate spikes, SLO burn | Agent(sre-engineer) | Observability, SLI/SLO analysis |
| Managed DB issues (connections, replication, CPU) | Agent(sre-engineer) | Database reliability, capacity |
| Networking, ingress, DNS, connectivity | Agent(sre-engineer) + Agent(devops-engineer) | Network diagnostics + infra config |
| Deployment failures, rollback needed | Agent(devops-engineer) | CI/CD, Helm, K8s deployment |
| Terraform/infra drift, resource config | Agent(devops-engineer) | IaC, cloud resource management |
| Application errors visible in logs | Stack-specific role | Agent(java-engineer), Agent(python-engineer), Agent(frontend-engineer) |
| General / unclear | Agent(sre-engineer) | SRE debugging methodology as default for prod |
Announce the applied role(s) to the user. If this is a P1/P2 incident, always apply Agent(sre-engineer).
Before collecting data, establish the environment context:
Detect platform from CLAUDE.md and verify authentication. Consult cloud-platforms skill for platform-specific auth verification commands:
gcloud config get-value project + gcloud auth listaz account show + az aks listaws sts get-caller-identity + aws eks list-clustersConfirm the active project/subscription/account matches the user's target.
// turbo
kubectl config current-context
kubectl cluster-info
Also run platform-specific cluster list command from cloud-platforms skill to verify cluster status.
Record: Cluster name, location, version, node count, current kubectl context.
Run the following read-only diagnostic commands. Present results as a structured summary.
// turbo
kubectl get nodes -o wide
kubectl top nodes
Flag: Nodes in NotReady state, high CPU/memory utilization (>80%), version skew between nodes.
For the affected namespace (or all namespaces if unspecified):
// turbo
kubectl get pods -n <namespace> -o wide --sort-by='.status.startTime'
kubectl get pods -n <namespace> --field-selector=status.phase!=Running
Flag:
CrashLoopBackOff, Error, Pending, ImagePullBackOffkubectl get deployments)kubectl describe pod <pod-name> -n <namespace>
kubectl logs --tail=200 --timestamps <pod-name> -n <namespace>
kubectl logs --previous --tail=50 <pod-name> -n <namespace>
Look for: OOMKilled (exit code 137), application exceptions, connection errors, startup failures, readiness probe failures.
// turbo
kubectl top pods -n <namespace> --sort-by=memory
kubectl get hpa -n <namespace>
Flag: Pods near memory limits, HPA at max replicas, CPU throttling.
Run platform-specific database diagnostic commands from cloud-platforms skill:
gcloud sql instances list + gcloud sql instances describeaz postgres flexible-server show or az sql server listaws rds describe-db-instancesKey metrics (from cloud monitoring or CLI):
// turbo
kubectl get ingress -n <namespace>
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace>
Flag: Services with 0 endpoints (no healthy backends), Ingress with no address, port mismatches.
// turbo
kubectl get events -n <namespace> --sort-by='.lastTimestamp' --field-selector type!=Normal
Record: Warning/Error events — especially FailedScheduling, OOMKilled, FailedMount, Unhealthy, BackOff.
Run platform-specific monitoring commands from cloud-platforms skill to list dashboards and check metrics.
Guide the user to check relevant dashboards for:
Using the applied role's expertise:
Agent(sre-engineer) active): error budget consumed, burn rate<common_prod_issues>
Structure the diagnosis:
## Production Environment Summary
- Cloud: [GCP/Azure/AWS] | Project/Sub/Account: [id] | Cluster: [name] ([location])
- Nodes: [count] ([healthy]/[total]) | K8s version: [version]
- Managed DB: [instance] ([state], [tier])
## SLO Impact (if applicable)
- SLO: [target] | Current: [actual] | Error budget: [remaining]%
- Burn rate: [Nx] | Time to budget exhaustion: [duration]
## Findings
### [Issue 1: title]
- **Symptom**: what was observed
- **Evidence**: specific log lines, metrics, pod status
- **Root cause**: why it's happening
- **Severity**: P1 (outage) / P2 (degraded) / P3 (minor) / P4 (cosmetic)
- **Blast radius**: affected users, services, regions
## Recommendations
1. [Immediate mitigation] — [command] ⚠️ REQUIRES APPROVAL
2. [Root cause fix] — [change description]
3. [Prevention] — [long-term improvement]
## Environment Health: [HEALTHY | DEGRADED | OUTAGE]
⚠️ All production mutations require explicit user approval.
Based on the diagnosis:
/analyze-local), then deploy through normal CI/CDAgent(devops-engineer) patterns via PR, never direct applyAfter any fix, re-run relevant diagnostic commands from Step 4 to verify resolution.
Present the completed analysis:
/bugfix (production environment diagnostics)Agent(sre-engineer), Agent(devops-engineer)cloud-platforms skill (platform-specific CLI commands, managed service diagnostics)development
Use this skill when running the recurring (daily) knowledge-base rescan for a repo that already has knowledge/.knowledge-sync.yml — the main-thread dispatcher that reads the config, computes the git delta since last_scanned_sha, maps changed paths to affected doc areas, early-exits cheaply when nothing changed, then fans out one Agent(content-writer) per affected area, applies the propose/direct update policy, advances the baseline only on success, and writes an L4 run log — all with the G1 untrusted-content choke-point, secret-scan, deny-list, and budget controls woven in. For first-time setup use /knowledge-sync-init.
development
Use this skill when bootstrapping scheduled knowledge-base sync for a repo that has no knowledge/.knowledge-sync.yml yet — to run one-time setup that detects the knowledge_root from CLAUDE.md/AGENTS.md, maps doc areas to source globs, records opt-in external sources (Linear/Notion/WebFetch, all disabled by default), captures a baseline last_scanned_sha, sets the per-area update policy, generates or seeds knowledge/CONVENTIONS.md, provisions the L4 memory dir, and offers to register the daily routine. Routes ongoing recurring sync operations to /knowledge-sync.
tools
Use this skill when bootstrapping a target repository to be ai-skills-aware — on the first run of any ai-skills workflow in a fresh repo, when adopting the ai-skills plugin in an existing repo, or after upgrading to a plugin version that adds new memory paths or templates, including when the user does not say "init" but asks to "set up" or "onboard" the repo — to detect codebase type, create CLAUDE.md + AGENTS.md scaffolding, initialize the .ai-skills-memory/ directory tree from L1 templates, and configure .gitignore. Idempotent — safe to re-run. Accepts `--codebase-type <type>` and `--overwrite`. Not for re-initializing only memory — use `/memory-init` instead.
tools
Use this skill when extending, repairing, or improving plugin assets, when ingesting a `/feedback` report as a fix-cycle backlog, or when you do not remember which lower-level command is right for the job — the umbrella workflow for ai-skills plugin-asset authoring and maintenance: creating, auditing, fixing, improving, refactoring, and migrating skills, agents, rules, hooks, prompts, schemas, and rubrics inside the plugin. Auto-classifies the request, loads the right knowledge skills (`@prompt-engineering`, `@context-engineering`, `@team-protocols`), and spawns the right subagents (`prompt-engineer`, `system-architect`, `python-engineer`, `software-engineer`, `qa-engineer`, `eval-judge`) via the `Agent` tool.