skills/capacity-planning/SKILL.md
Produce a capacity planning document for a service covering traffic forecasts, resource requirements, and scaling strategy. Use when asked to plan infrastructure capacity, forecast resource needs, model traffic growth, define scaling strategy, or produce a capacity review for a service. Produces a structured capacity plan covering current baseline metrics, growth projections, resource requirements per tier, scaling strategy, cost projections, capacity triggers, and an infrastructure action roadmap.
npx skillsauth add mohitagw15856/pm-claude-skills capacity-planningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Produce a complete capacity planning document for a service. Capacity planning is not about predicting the future exactly — it is about understanding current headroom, modelling growth, and ensuring the team takes infrastructure action before a constraint becomes an incident.
A good capacity plan answers: what is running out first, how long before it runs out, what does it cost to fix it, and who decides when to act.
Ask for these if not already provided:
Service: [Name] | Team: [Team name] Author: [Name] | Last updated: [Date] Planning horizon: [12 months — [Month Year] to [Month Year]] Review cadence: [Quarterly]
[3–5 sentences covering: current state, the most critical capacity constraint, the timeline before it becomes a risk, the recommended action, and the cost implication. Written for an engineering manager or VP who needs the key facts without reading the full document.]
Critical finding: [e.g. "The database connection pool will reach 90% utilisation within 6 weeks at current growth. Without action, this will cause request queueing and latency spikes under normal traffic."]
Recommended immediate action: [e.g. "Increase connection pool limit and add a read replica within the next 2 weeks."]
Estimated cost impact: [e.g. "Recommended changes add ~$[X]/month to infrastructure spend."]
All metrics are 30-day averages unless noted. Date captured: [Date]
| Metric | Value | Peak (7-day) | Notes | |---|---|---|---| | Requests per second (avg) | [X req/s] | [X req/s] | [Peak time / day of week] | | Requests per day | [X M/day] | [X M/day] | — | | Active users (DAU/MAU) | [X] / [X] | — | — | | [Service-specific metric — e.g. jobs processed/hour] | [X] | [X] | — | | [Service-specific metric — e.g. GB ingested/day] | [X GB] | [X GB] | — |
| Resource | Current utilisation | Instance type | Count | Notes | |---|---|---|---|---| | CPU (avg) | [X%] | [e.g. c5.2xlarge] | [X] | Peak: [X%] | | Memory (avg) | [X%] | — | — | Peak: [X%] | | Network egress | [X Mbps] | — | — | — | | Container / pod count | [X] | [e.g. 2 vCPU / 4 GB] | — | Auto-scaling range: [X–Y] |
| Resource | Current utilisation | Spec | Notes | |---|---|---|---| | CPU | [X%] | [e.g. db.r5.2xlarge] | Peak: [X%] | | Memory | [X%] | [X GB RAM] | — | | Storage used | [X GB] of [Y GB] ([Z%]) | [X GB provisioned] | Growth: [~X GB/month] | | IOPS (avg) | [X] of [Y provisioned] | [Y IOPS] | Peak: [X IOPS] | | Connection pool | [X] of [Y max] ([Z%]) | Max connections: [Y] | [ORM pool size: X] | | Query P99 latency | [X ms] | — | [Slowest query: X] | | Read/write ratio | [X%] reads / [Y%] writes | — | — |
| Resource | Current utilisation | Spec | Notes | |---|---|---|---| | Memory used | [X GB] of [Y GB] ([Z%]) | [e.g. cache.r6g.large] | Eviction rate: [X%] | | Hit rate | [X%] | — | Miss rate: [Y%] | | Connections | [X] | Max: [Y] | — |
| Resource | Current usage | Growth rate | Notes | |---|---|---|---| | [S3 / GCS / Blob] | [X GB / TB] | [~X GB/month] | [Lifecycle policies in place? Y/N] | | Disk (if applicable) | [X GB] of [Y GB] | [~X GB/month] | [RAID / EBS type] |
| Component | Current monthly cost | % of total | |---|---|---| | Compute (app servers) | $[X] | [X%] | | Database | $[X] | [X%] | | Cache | $[X] | [X%] | | Storage | $[X] | [X%] | | CDN / bandwidth | $[X] | [X%] | | Other ([describe]) | $[X] | [X%] | | Total | $[X] | 100% |
Unit economics: $[X] per [1,000 requests / 1,000 users / GB processed]
| Assumption | Value | Source | Confidence | |---|---|---|---| | Monthly traffic growth rate | [X%] | [Historical trend / product forecast] | [High / Medium / Low] | | Seasonal peak factor | [+X% in [month(s)]] | [Last year's data / expected launch] | [High / Medium] | | Upcoming events | [e.g. Marketing campaign — [Month], expected +[X]% traffic spike] | [Marketing plan] | [Medium] | | User growth | [X new users/month] | [Sales pipeline / growth model] | [Medium] | | Data growth | [X GB/month] | [Current trend] | [High] |
| Timeframe | Req/s (avg) | Req/s (peak) | DAU | Data volume (cumulative) | |---|---|---|---|---| | Now (baseline) | [X] | [X] | [X] | [X GB/TB] | | +3 months | [X] | [X] | [X] | [X GB/TB] | | +6 months | [X] | [X] | [X] | [X GB/TB] | | +12 months | [X] | [X] | [X] | [X GB/TB] |
Growth formula: [Baseline] × (1 + [monthly rate])^[months] + seasonal adjustment
When does each resource run out at current utilisation and projected growth?
| Resource | Current utilisation | Safe ceiling | Headroom remaining | Months to ceiling | |---|---|---|---|---| | App CPU | [X%] | 70% | [X%] | [X months] | | App memory | [X%] | 80% | [X%] | [X months] | | DB CPU | [X%] | 70% | [X%] | [X months] | | DB storage | [X GB] of [Y GB] | 80% = [Z GB] | [X GB] | [X months] | | DB IOPS | [X] of [Y] | 80% = [Z] | [X IOPS] | [X months] | | DB connections | [X] of [Y] | 80% = [Z] | [X] | [X months] | | Cache memory | [X GB] of [Y GB] | 75% = [Z GB] | [X GB] | [X months] | | Storage (object) | [X TB] | No hard limit — cost trigger | — | [Cost trigger: $X/month] |
Red flags (resources hitting ceiling within 3 months):
| Timeframe | Required instances | Recommended instance type | Auto-scaling range | Notes | |---|---|---|---|---| | Now | [X] | [type] | [min: X, max: Y] | Current configuration | | +3 months | [X] | [type] | [min: X, max: Y] | [Any instance type change needed?] | | +6 months | [X] | [type or upgrade] | [min: X, max: Y] | [Consider [larger type / horizontal scale]] | | +12 months | [X] | [type or upgrade] | [min: X, max: Y] | [State of horizontal vs vertical decision] |
Memory headroom target: Maintain ≥30% available memory at average load; ≥20% at peak. CPU headroom target: Maintain ≥30% available CPU at average load; ≥15% at peak.
| Timeframe | Instance type | Storage | IOPS | Read replica | Notes | |---|---|---|---|---|---| | Now | [type] | [X GB] | [X] | [Y/N] | Current | | +3 months | [type] | [X GB] | [X] | [Y/N] | [Upgrade storage / IOPS] | | +6 months | [type or upgrade] | [X GB] | [X] | Yes | [Read replica recommended by this point] | | +12 months | [type] | [X GB] | [X] | [X replicas] | [Consider sharding / partitioning at this scale] |
Storage growth management:
| Timeframe | Node type | Nodes | Memory | Notes | |---|---|---|---|---| | Now | [type] | [X] | [X GB] | Current | | +6 months | [type] | [X] | [X GB] | [Scale out or upgrade] | | +12 months | [type] | [X] | [X GB] | [Cluster mode if >Y GB required] |
Decision: [Horizontal / Vertical / Both]
[State the scaling strategy and the reasoning. E.g. "The application is stateless and CPU-bound; horizontal scaling is preferred. Vertical scaling is a short-term fallback only."]
Auto-scaling configuration:
Scale-out trigger: CPU > [X%] for [Y minutes] OR memory > [X%] for [Y minutes]
Scale-in trigger: CPU < [X%] for [Y minutes] AND memory < [X%] for [Y minutes]
Min instances: [X] (ensures HA across [X] AZs)
Max instances: [Y] (cost ceiling)
Cooldown period: [X seconds]
Warmup time: [X seconds] (time for new instance to be healthy)
Limits of horizontal scaling:
Strategy: [Read replica / Connection pooling via PgBouncer / Query caching / None needed yet]
When to add a read replica:
Connection pooling:
Cache policy: [Cache-aside / Write-through / Write-behind] TTL strategy:
| Data type | TTL | Invalidation method | |---|---|---| | [e.g. User profile] | [5 minutes] | [Explicit invalidation on update] | | [e.g. Product catalog] | [1 hour] | [TTL expiry — eventual consistency acceptable] | | [e.g. Session data] | [24 hours] | [Explicit invalidation on logout] |
Cache miss handling: [Describe what happens on a cache miss — does it fall through gracefully or cause a thundering herd risk?]
| Component | Now (monthly) | +3 months | +6 months | +12 months | |---|---|---|---|---| | Compute | $[X] | $[X] | $[X] | $[X] | | Database | $[X] | $[X] | $[X] | $[X] | | Cache | $[X] | $[X] | $[X] | $[X] | | Storage | $[X] | $[X] | $[X] | $[X] | | CDN / bandwidth | $[X] | $[X] | $[X] | $[X] | | Total | $[X] | $[X] | $[X] | $[X] | | MoM growth % | — | [X%] | [X%] | [X%] |
Unit economics trend:
| Timeframe | Cost per 1k requests | Cost per user/month | Notes | |---|---|---|---| | Now | $[X] | $[X] | Baseline | | +6 months | $[X] | $[X] | [Improving / worsening — why] | | +12 months | $[X] | $[X] | [Target: $X per 1k requests] |
Cost optimisation opportunities:
| Opportunity | Estimated saving | Effort | Timeline | |---|---|---|---| | [e.g. Reserved instances for baseline compute] | $[X/month] | Low | Immediate | | [e.g. S3 lifecycle policy — move objects >90 days to Glacier] | $[X/month] | Low | This sprint | | [e.g. Right-size [instance] — current is overprovisioned] | $[X/month] | Low | This sprint | | [e.g. Optimise top-5 slow queries — reduce DB compute need] | $[X/month] | Medium | Next quarter |
Define the thresholds that require explicit action — not retrospective fixes after an incident.
| Resource | Watch (amber) | Act (red — schedule work) | Emergency (incident risk) | |---|---|---|---| | App CPU (sustained avg) | >60% | >70% | >85% | | App memory | >70% | >80% | >90% | | DB CPU | >55% | >65% | >80% | | DB storage | >65% | >75% | >85% | | DB connections | >60% | >70% | >85% | | Cache memory / eviction | Hit rate <90% | Hit rate <85% | Hit rate <75% | | Error rate | >0.5% | >1% | >2% | | P99 latency | >2× baseline | >3× baseline | >5× baseline |
When a Watch threshold is crossed:
When an Act threshold is crossed:
When an Emergency threshold is crossed:
Emergency scaling runbook: [Link to oncall-runbook for capacity incidents]
| Action | Owner | Effort | Justification | |---|---|---|---| | [e.g. Increase DB connection pool limit to X] | [Name] | [2 hours] | [DB connections at X% — hitting ceiling in X weeks] | | [e.g. Enable storage auto-scaling on RDS] | [Name] | [30 min] | [Storage at X% — prevents emergency at X months] | | [e.g. Add S3 lifecycle policy for [bucket]] | [Name] | [1 hour] | [Storage growing at $X/month unnecessarily] |
| Action | Owner | Effort | Justification | |---|---|---|---| | [e.g. Add read replica to production DB] | [Name] | [1 day] | [DB CPU projected to hit 65% in 2 months] | | [e.g. Increase max auto-scaling limit from X to Y] | [Name] | [2 hours] | [Current max is too close to expected peak] | | [e.g. Configure PgBouncer for connection pooling] | [Name] | [3 days] | [Reduce per-connection overhead; headroom for growth] |
| Action | Owner | Effort | Justification | |---|---|---|---| | [e.g. Upgrade DB instance class — [current] → [next]] | [Name] | [2 hours — blue/green] | [DB CPU projected to hit 70% by Q[X]] | | [e.g. Implement caching for [high-read endpoint]] | [Name] | [1 week] | [Reduce DB read load by estimated [X%]] | | [e.g. Evaluate horizontal DB sharding] | [Name] | [2 weeks (spike)] | [At 12-month projections, single DB hits limits] |
| Action | Description | Trigger condition | |---|---|---| | [e.g. Multi-region deployment] | [Active-passive setup in eu-west-2] | [DAU exceeds X or SLA requires 99.99%] | | [e.g. Database sharding or migration to distributed DB] | [Evaluate CockroachDB / Vitess] | [Single-node DB projected to hit ceiling] | | [e.g. CDN expansion] | [Add PoPs in [region]] | [Latency SLO breached for [geography]] |
development
Build a framework for creating shareable, high-reach social media content. Use when asked to plan viral content, develop a shareable content strategy, create a hook writing system, or build a repeatable process for content that gets shared. Produces a platform-specific viral content framework with hook formulas, content structures, shareability triggers, and a content testing system.
development
Generate article or newsletter thumbnail candidates using the Gemini API from inside Claude Code. Claude reads article copy, proposes composition concepts, writes image generation prompts incorporating brand specs, calls Gemini to generate the images, evaluates the results via computer vision, and returns ranked candidates with rationale. Use when asked to create thumbnails, generate cover images, or produce visual candidates for an article or newsletter.
testing
Flips Claude's default from "find reasons you're right" to "find reasons you're wrong." A genuine thinking partner, not a mirror with grammar. Use before high-stakes decisions, plans, assumptions, or pitches you haven't stress-tested.
development
Scrapes a Substack Notes page and exports engagement data (likes, comments, restacks) to a formatted .xlsx file with conditional formatting and summary stats.