ui/src/skills/cluster-resource-health/SKILL.md
Check Kubernetes cluster health including pod status, node conditions, resource utilization, and pending alerts across EKS clusters. Use when monitoring infrastructure health, investigating capacity issues, or performing cluster audits.
npx skillsauth add cnoe-io/ai-platform-engineering cluster-resource-healthInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Query AWS EKS clusters for node health, pod status, resource utilization, and alerts to produce a cluster health dashboard.
## Cluster Resource Health Report
**Generated**: February 9, 2026
### Cluster Summary
| Cluster | Version | Nodes | Status | Overall Health |
|---------|---------|-------|--------|----------------|
| prod-us-west-2 | 1.29 | 12/12 Ready | Active | HEALTHY |
| staging-us-west-2 | 1.28 | 4/4 Ready | Active | WARNING |
### Resource Utilization (prod-us-west-2)
| Resource | Requested | Allocatable | Utilization |
|----------|-----------|-------------|-------------|
| CPU | 38 cores | 48 cores | 79% |
| Memory | 96 Gi | 128 Gi | 75% |
| Pods | 187 | 440 | 43% |
**Headroom**: Can schedule ~10 more standard pods (1 CPU, 2Gi each)
### Problematic Pods
| Pod | Namespace | Status | Restarts | Node |
|-----|-----------|--------|----------|------|
| payment-api-7d4b8c | production | CrashLoopBackOff | 23 | node-3 |
| data-pipeline-abc | batch | OOMKilled | 5 | node-7 |
| image-proc-xyz | processing | ImagePullBackOff | 0 | node-2 |
### Node Health
| Node | Status | CPU Req% | Mem Req% | Pods | Conditions |
|------|--------|----------|----------|------|------------|
| node-1 | Ready | 82% | 71% | 18 | OK |
| node-7 | Ready | 91% | 88% | 22 | MemoryPressure |
### Capacity Risks
1. **HIGH**: node-7 at 91% CPU / 88% memory - consider scaling node group
2. **MEDIUM**: staging cluster on EKS 1.28 - EOL in 60 days, plan upgrade
3. **LOW**: 3 PVCs at >80% capacity in `data` namespace
### Recommendations
1. **Immediate**: Investigate payment-api CrashLoopBackOff (23 restarts)
2. **Short-term**: Scale prod node group from 12 to 14 nodes (headroom at 79%)
3. **Planned**: Upgrade staging cluster from EKS 1.28 to 1.29
4. **Optimization**: Right-size data-pipeline pods (OOMKilled - increase memory limit)
testing
Compare A2A streaming behaviour across supervisor versions. Captures SSE events, analyzes metadata flags (is_narration, is_final_answer), and produces side-by-side comparison reports.
testing
Generate a comprehensive sprint progress report from Jira with velocity metrics, burndown analysis, blocker identification, and team workload distribution. Use when preparing sprint reviews, standups, or tracking sprint health mid-cycle.
development
Scan GitHub repositories for security vulnerabilities including Dependabot alerts, code scanning results, and secret scanning findings. Use when auditing repository security, preparing compliance reports, or triaging vulnerability alerts.
development
Perform a comprehensive code review of a specific GitHub Pull Request. Analyzes code changes, checks for bugs, security issues, test coverage, and coding standards compliance. Use when a user provides a PR URL or asks to review a specific pull request.