aws-cost-optimization/SKILL.md
AWS cost optimization and FinOps workflows. Use for finding unused resources, analyzing Reserved Instance opportunities, detecting cost anomalies, rightsizing instances, evaluating Spot instances, migrating to newer generation instances, implementing FinOps best practices, optimizing storage/network/database costs, and managing cloud financial operations. Includes automated analysis scripts and comprehensive reference documentation.
npx skillsauth add ahmedasmar/devops-claude-skills aws-cost-finopsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic workflows for AWS cost optimization and financial operations management.
Use this skill when you need to:
Follow this systematic approach for AWS cost optimization:
┌─────────────────────────────────────────────┐
│ 1. DISCOVER │
│ What are we spending money on? │
│ Run: find_unused_resources.py │
│ Run: cost_anomaly_detector.py │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 2. ANALYZE │
│ Where are the optimization opportunities?│
│ Run: rightsizing_analyzer.py │
│ Run: detect_old_generations.py │
│ Run: spot_recommendations.py │
│ Run: analyze_ri_recommendations.py │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 3. PRIORITIZE │
│ What should we optimize first? │
│ - Quick wins (low risk, high savings) │
│ - Low-hanging fruit (easy to implement) │
│ - Strategic improvements │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 4. IMPLEMENT │
│ Execute optimization actions │
│ - Delete unused resources │
│ - Rightsize instances │
│ - Purchase commitments │
│ - Migrate to new generations │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 5. MONITOR │
│ Verify savings and track metrics │
│ - Monthly cost reviews │
│ - Tag compliance monitoring │
│ - Budget variance tracking │
└─────────────────────────────────────────────┘
Frequency: Run monthly (first week of each month)
Step 1: Find Unused Resources
# Scan for waste across all resources
python3 scripts/find_unused_resources.py
# Expected output:
# - Unattached EBS volumes
# - Old snapshots
# - Unused Elastic IPs
# - Idle NAT Gateways
# - Idle EC2 instances
# - Unused load balancers
# - Estimated monthly savings
Step 2: Analyze Cost Anomalies
# Detect unusual spending patterns
python3 scripts/cost_anomaly_detector.py --days 30
# Expected output:
# - Cost spikes and anomalies
# - Top cost drivers
# - Period-over-period comparison
# - 30-day forecast
Step 3: Identify Rightsizing Opportunities
# Find oversized instances
python3 scripts/rightsizing_analyzer.py --days 30
# Expected output:
# - EC2 instances with low utilization
# - RDS instances with low utilization
# - Recommended smaller instance types
# - Estimated savings
Step 4: Generate Monthly Report
# Use the template to compile findings
cp assets/templates/monthly_cost_report.md reports/$(date +%Y-%m)-cost-report.md
# Fill in:
# - Findings from scripts
# - Action items
# - Team cost breakdowns
# - Optimization wins
Step 5: Team Review Meeting
When: Quarterly or when usage patterns stabilize
Step 1: Analyze Current Usage
# Identify workloads suitable for commitments
python3 scripts/analyze_ri_recommendations.py --days 60
# Looks for:
# - EC2 instances running consistently for 60+ days
# - RDS instances with stable usage
# - Calculates ROI for 1yr vs 3yr commitments
Step 2: Review Recommendations
Evaluate each recommendation:
✅ Good candidate if:
- Running 24/7 for 60+ days
- Workload is stable and predictable
- No plans to change architecture
- Savings > 30%
❌ Poor candidate if:
- Workload is variable or experimental
- Architecture changes planned
- Instance type may change
- Dev/test environment
Step 3: Choose Commitment Type
Reserved Instances:
Savings Plans:
Decision Matrix:
Known instance type, won't change → Standard RI
May need to change types → Convertible RI or Compute SP
Variable workloads → Compute Savings Plan
Maximum flexibility → Compute Savings Plan
Step 4: Purchase and Track
Reference: See references/best_practices.md for detailed commitment strategies
When: During architecture reviews or optimization sprints
Step 1: Detect Old Instances
# Find outdated instance generations
python3 scripts/detect_old_generations.py
# Identifies:
# - t2 → t3 migrations (10% savings)
# - m4 → m5 → m6i migrations
# - Intel → Graviton opportunities (20% savings)
Step 2: Prioritize Migrations
Quick Wins (Low Risk):
t2 → t3: Drop-in replacement, 10% savings
m4 → m5: Better performance, 5% savings
gp2 → gp3: No downtime, 20% savings
Medium Effort (Test Required):
x86 → Graviton (ARM64): 20% savings
- Requires ARM64 compatibility testing
- Most modern frameworks support ARM64
- Test in staging first
Step 3: Execute Migration
For EC2 (x86 to x86):
For Graviton Migration:
Step 4: Validate Savings
Reference: See references/best_practices.md → Compute Optimization
When: For fault-tolerant workloads or Auto Scaling Groups
Step 1: Identify Candidates
# Analyze workloads for Spot suitability
python3 scripts/spot_recommendations.py
# Evaluates:
# - Instances in Auto Scaling Groups (good candidates)
# - Dev/test/staging environments
# - Batch processing workloads
# - CI/CD and build servers
Step 2: Assess Suitability
Excellent for Spot:
NOT suitable for Spot:
Step 3: Implementation Strategy
Option 1: Fargate Spot (Easiest)
# ECS task definition
requiresCompatibilities:
- FARGATE
capacityProviderStrategy:
- capacityProvider: FARGATE_SPOT
weight: 70 # 70% Spot
- capacityProvider: FARGATE
weight: 30 # 30% On-Demand
Option 2: EC2 Auto Scaling with Spot
# Mixed instances policy
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 2
OnDemandPercentageAboveBaseCapacity: 30
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.large
- InstanceType: m5a.large
- InstanceType: m5n.large
Option 3: EC2 Spot Fleet
# Create Spot Fleet with diverse instance types
aws ec2 request-spot-fleet --spot-fleet-request-config file://spot-fleet.json
Step 4: Implement Interruption Handling
# Handle 2-minute termination notice
# Instance metadata: /latest/meta-data/spot/instance-action
# In application:
1. Poll for termination notice
2. Gracefully shutdown (save state)
3. Drain connections
4. Exit
Reference: See references/best_practices.md → Compute Optimization → Spot Instances
ls scripts/
# find_unused_resources.py
# analyze_ri_recommendations.py
# detect_old_generations.py
# spot_recommendations.py
# rightsizing_analyzer.py
# cost_anomaly_detector.py
Monthly Review (Run all):
python3 scripts/find_unused_resources.py
python3 scripts/cost_anomaly_detector.py --days 30
python3 scripts/rightsizing_analyzer.py --days 30
Quarterly Optimization:
python3 scripts/analyze_ri_recommendations.py --days 60
python3 scripts/detect_old_generations.py
python3 scripts/spot_recommendations.py
Specific Region Only:
python3 scripts/find_unused_resources.py --region us-east-1
python3 scripts/rightsizing_analyzer.py --region us-west-2
Named AWS Profile:
python3 scripts/find_unused_resources.py --profile production
python3 scripts/cost_anomaly_detector.py --profile production --days 60
# Install dependencies
pip install boto3 tabulate
# AWS credentials required
# Configure via: aws configure
# Or use: --profile PROFILE_NAME
Key Actions:
Reference: references/best_practices.md → Compute Optimization
Key Actions:
Reference: references/best_practices.md → Storage Optimization
Key Actions:
Reference: references/best_practices.md → Network Optimization
Key Actions:
Reference: references/best_practices.md → Database Optimization
Need help choosing between services?
Question: "Should I use EC2, Lambda, or Fargate?"
Answer: See references/service_alternatives.md → Compute Alternatives
Question: "Which S3 storage class should I use?"
Answer: See references/service_alternatives.md → Storage Alternatives
Question: "Should I use RDS or Aurora?"
Answer: See references/service_alternatives.md → Database Alternatives
Question: "NAT Gateway vs VPC Endpoint vs NAT Instance?"
Answer: See references/service_alternatives.md → Networking Alternatives
Phase 1: Foundation (Month 1)
Phase 2: Visibility (Months 2-3)
Phase 3: Culture (Ongoing)
Full Guide: See references/finops_governance.md
Week 1: Data Collection
Week 2: Analysis
Week 3: Team Reviews
Week 4: Executive Reporting
Template: See assets/templates/monthly_cost_report.md
Detailed Process: See references/finops_governance.md → Monthly Review Process
Run cost anomaly detection:
python3 scripts/cost_anomaly_detector.py --days 30
Check Cost Explorer for service breakdown
Review CloudTrail for resource creation events
Check for AutoScaling events
Verify no Reserved Instances expired
Follow the optimization workflow:
Run RI analysis:
python3 scripts/analyze_ri_recommendations.py --days 60
Look for:
Run unused resource finder:
python3 scripts/find_unused_resources.py
Safe to delete (usually):
Always verify with resource owner before deletion!
Detailed References:
references/best_practices.md: Comprehensive optimization strategiesreferences/service_alternatives.md: Cost-effective service selectionreferences/finops_governance.md: Organizational FinOps practicesTemplates:
assets/templates/monthly_cost_report.md: Monthly reporting templateScripts:
scripts/ directory with --help for usageAWS Documentation:
tools
Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
testing
Systematic Kubernetes troubleshooting and incident response. Use when diagnosing pod failures, cluster issues, performance problems, networking issues, storage failures, or responding to production incidents. Provides diagnostic workflows, automated health checks, and comprehensive remediation guidance for common Kubernetes problems.
development
Infrastructure as Code with Terraform and Terragrunt. Use for creating, validating, troubleshooting, and managing Terraform configurations, modules, and state. Covers Terraform workflows, best practices, module development, state management, Terragrunt patterns, and common issue resolution.
development
--- name: gitops-workflows description: GitOps deployment workflows with ArgoCD and Flux. Use for setting up GitOps (ArgoCD 3.x, Flux 2.7), designing repository structures (monorepo/polyrepo, app-of-apps), multi-cluster deployments (ApplicationSets, hub-spoke), secrets management (SOPS+age, Sealed Secrets, External Secrets Operator), progressive delivery (Argo Rollouts, Flagger), troubleshooting sync issues, and OCI artifact management. Covers latest 2024-2025 features: ArgoCD annotation-based t