skills/aws-resource-health-diagnose/SKILL.md
Analyze AWS resource health, diagnose issues from CloudWatch logs and metrics, and create a remediation plan for identified problems.
npx skillsauth add williamlimasilva/.copilot aws-resource-health-diagnoseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This workflow analyzes a specific AWS resource to assess its health status, diagnose potential issues using CloudWatch logs and metrics, and develop a comprehensive remediation plan for any problems discovered.
Fetch https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ for monitoring and troubleshooting guidance to inform the diagnostic approach.
Locate the target resource using the appropriate AWS CLI command for its type:
# EC2
aws ec2 describe-instances --filters "Name=tag:Name,Values=<name>"
# Lambda
aws lambda get-function --function-name <name>
# RDS
aws rds describe-db-instances --db-instance-identifier <name>
# ECS
aws ecs describe-services --cluster <cluster> --services <name>
# ALB
aws elbv2 describe-load-balancers --names <name>
# DynamoDB
aws dynamodb describe-table --table-name <name>
# SQS
aws sqs get-queue-attributes --queue-url <url> --attribute-names All
# API Gateway
aws apigatewayv2 get-apis
If multiple matches are found, prompt the user to specify region/account.
Run service-specific health checks:
# EC2
aws ec2 describe-instance-status --instance-ids <id>
# RDS
aws rds describe-db-instances --db-instance-identifier <name> \
--query 'DBInstances[0].DBInstanceStatus'
# Lambda - error rate over 24h
aws cloudwatch get-metric-statistics --namespace AWS/Lambda \
--metric-name Errors --dimensions Name=FunctionName,Value=<name> \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 3600 --statistics Sum
# ECS
aws ecs describe-services --cluster <cluster> --services <name> \
--query 'services[0].[status,runningCount,desiredCount,pendingCount]'
Key health indicators by service type:
Find log groups and run CloudWatch Logs Insights queries:
# Find log groups
aws logs describe-log-groups --log-group-name-prefix /aws/<service>/<name>
# Start a query (last 24h errors)
aws logs start-query \
--log-group-name /aws/lambda/<name> \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'filter @message like /ERROR/ | stats count(*) as errorCount by bin(1h)'
# Get results
aws logs get-query-results --query-id <id>
# Lambda cold starts
aws logs start-query \
--log-group-name /aws/lambda/<name> \
--start-time $(date -u -d '24 hours ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'filter @type = "REPORT" | filter @initDuration > 0 | stats count() as coldStarts by bin(1h)'
# RDS Performance Insights (if enabled)
aws pi get-resource-metrics \
--service-type RDS --identifier db:<identifier> \
--metric-queries '[{"Metric":"db.load.avg"}]' \
--start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period-in-seconds 3600
Identify: recurring error patterns, correlation with deployments (CloudTrail), performance trends, dependency failures.
Severity:
Root Cause Categories:
Immediate Actions (Critical):
# Lambda throttling — increase reserved concurrency
aws lambda put-reserved-concurrency \
--function-name <name> --reserved-concurrent-executions 100
# RDS connection exhaustion — reboot to reset connections
aws rds reboot-db-instance --db-instance-identifier <name>
Short-term Fixes (High/Medium): Configuration adjustments, right-sizing, CloudWatch alarm improvements, IAM corrections.
Long-term Improvements: Architectural changes for resilience, preventive monitoring, enable AWS Health Dashboard notifications via EventBridge.
Present findings:
🏥 AWS Resource Health Assessment
📊 Resource Overview:
• Resource: [Name] ([Type])
• Status: [Healthy/Warning/Critical]
• Region: [Region] | Account: [Account ID]
🚨 Issues Identified:
• Critical: X | High: Y | Medium: Z | Low: N
🔍 Top Issues:
1. [Issue]: [Description] — Impact: [High/Medium/Low]
2. [Issue]: [Description] — Impact: [High/Medium/Low]
🛠️ Remediation: X immediate, Y short-term, Z long-term actions
❓ Proceed with detailed remediation plan? (y/n)
Then generate a full markdown report covering: health metrics, issues with root cause analysis, phased remediation steps with AWS CLI commands, CloudWatch alarm recommendations, and validation checklist.
aws configurelogs:*, cloudwatch:*, pi:*)development
Build production RAG pipelines and persistent agent memory using Pinecone as the vector database backend. ALWAYS USE THIS SKILL when the user mentions Pinecone, wants to index documents for semantic search, build a retrieval-augmented generation system, store agent memory across sessions, implement hybrid search, or connect an LLM to a searchable knowledge base — even if they don't say "Pinecone" explicitly. Also use when the user asks about vector databases for RAG, namespace isolation for multi-tenant agents, embedding pipelines, or scaling a knowledge base beyond what local storage can handle. DO NOT use for local-only vector stores (Chroma, FAISS, pgvector) or pure keyword search with no semantic component.
development
Perform an AWS Well-Architected Framework review of the current workload IaC and architecture, generating findings and GitHub issues for improvements.
devops
Query AWS resources using natural language. Covers EC2, S3, RDS, Lambda, ECS, EKS, Secrets Manager, IAM, VPC, networking, messaging, and more. Strictly read-only — no writes, deletes, or mutations.
devops
Analyze AWS resources used in the app (IaC files and/or resources in a target account/region) and optimize costs - creating GitHub issues for identified optimizations.