plugins/axiom-devops-engineering/skills/cicd-pipeline-architecture/SKILL.md
Use when setting up CI/CD pipelines, experiencing deployment failures, slow feedback loops, or production incidents after deployment - provides deployment strategies, test gates, rollback mechanisms, and environment promotion patterns to prevent downtime and enable safe continuous delivery
npx skillsauth add tachyon-beep/skillpacks cicd-pipeline-architectureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design CI/CD pipelines with deployment verification, rollback capabilities, and zero-downtime strategies from day one.
Core principle: "Deploy to production" is not a single step - it's a sequence of gates, health checks, gradual rollouts, and automated rollback triggers. Skipping these "for speed" causes production incidents.
Use this skill when:
Do NOT skip this for:
Every production pipeline MUST include:
1. Build → 2. Test → 3. Deploy to Staging → 4. Verify Staging → 5. Deploy to Production → 6. Verify Production → 7. Monitor
Missing any stage = production incidents waiting to happen.
Purpose: Compile, package, create artifacts
build:
- Compile code (if applicable)
- Run linters and formatters
- Build container image
- Tag with commit SHA (NOT "latest")
- Push to registry
- Create immutable artifact
Key principle: Build once, deploy everywhere. Same artifact to staging and production.
Test Pyramid in CI:
/\
/E2\ ← Few, critical paths only (5-10 tests)
/----\
/ Intg \ ← API contracts, DB integration (50-100 tests)
/--------\
/ Unit \ ← Fast, isolated, thorough (100s-1000s)
/____________\
Optimization strategies:
Anti-pattern: "Tests are slow, let's skip some" → Optimize execution, don't remove coverage
Staging MUST match production:
Deployment process:
1. Run database migrations (with rollback tested)
2. Deploy new version alongside old (blue-green)
3. Run smoke tests
4. Cutover traffic
5. Keep old version running for quick rollback
Automated verification (not manual testing):
verify_staging:
- Health check endpoint returns 200
- Critical API endpoints respond correctly
- Database migrations applied successfully
- Background jobs processing
- External integrations functional
Failure = stop pipeline, do NOT proceed to production.
Deployment Strategies (choose one):
Old (Blue) ← 100% traffic
New (Green) ← deployed, health checked, 0% traffic
→ Switch traffic to Green
→ Keep Blue running for 1 hour for rollback
→ Terminate Blue after monitoring shows Green is stable
Pros: Instant rollback, zero downtime Cons: Double infrastructure cost during deployment
Old ← 95% traffic
New ← 5% traffic (canary)
→ Monitor error rates, latency for 15 min
→ If healthy: 50% traffic
→ If healthy: 100% traffic
→ If unhealthy: immediate rollback to 100% old
Pros: Gradual risk, early warning Cons: More complex monitoring
Instances: [A, B, C, D, E]
→ Deploy to A, health check
→ Deploy to B, health check
→ Deploy to C, D, E sequentially
If any fails → stop, rollback deployed instances
Pros: No extra infrastructure Cons: Mixed versions during deployment
Choose based on:
NEVER: Direct deployment with restart (causes downtime)
Automated post-deployment verification:
verify_production:
- HTTP 200 from health endpoint
- Response time < baseline + 20%
- Error rate < 1%
- Critical user flows functional (synthetic tests)
- Database connections healthy
- Cache hit rates normal
Auto-rollback triggers:
Observe for 1 hour post-deployment:
Dashboard must show:
1. Write backward-compatible migrations
- Add columns as nullable first
- Create new tables before dropping old
- Add indexes with CONCURRENTLY (Postgres)
2. Deploy application code that works with old AND new schema
3. Run migration
4. Deploy code that uses new schema exclusively
5. Clean up old schema (separate deployment)
This takes 3 deployments, not 1. That's correct.
test_migrations:
- Apply migration to test DB
- Run application tests against migrated schema
- Test rollback (down migration)
- Verify data integrity
Never skip migration rollback testing. You'll need it in production.
Anti-patterns from baseline:
❌ Hardcoded in workflow:
env:
DATABASE_URL: postgresql://user:pass@localhost/db
✅ Correct:
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
Secrets checklist:
Progression:
Developer → CI Tests → Staging → Production
Gates between environments:
Before deploying to production, verify:
| Mistake | Why It's Wrong | Fix | |---------|---------------|-----| | "Just restart the service" | Causes downtime, no rollback | Use blue-green or canary deployment | | "Tests are slow, skip some" | Removes safety net | Parallel execution, smart caching | | "We'll add staging later" | Production becomes your staging | Create staging first, before production pipeline | | "Migrations in deployment script" | Can't roll back safely | Backward-compatible migrations, 3-step deployment | | "Manual verification after deploy" | Slow, error-prone, doesn't scale | Automated health checks and smoke tests | | "Deploy on main merge" | No gate, broken main can deploy | Require staging verification first | | Hardcoded database credentials | Security risk, can't rotate | Use secret manager | | "Single server is fine for now" | Downtime during deployment | Use multiple instances from day one |
| Excuse | Reality | |--------|---------| | "This is just an MVP/demo" | MVP pipelines become production pipelines. Build it right once. | | "Staging is expensive" | Production incidents are more expensive. Staging prevents them. | | "Blue-green doubles our costs" | Downtime and incidents cost more than temporary double infrastructure. | | "We'll add rollback later" | You need rollback when a deployment fails. Later = too late. | | "Health checks are overkill" | Silent failures in production are worse than no deployment. | | "Migrations always work" | They don't. Test rollbacks before you need them. | | "Our app is too simple for this" | Deployment complexity isn't about code complexity. |
If you catch yourself thinking:
All of these mean: Your pipeline will cause production incidents.
Related skills:
"Deploy to production" is not one step. It's:
Skipping steps to "move fast" causes incidents. This IS moving fast.
development
Use when **managing the delivery of work** rather than building it — running a project or a program, not writing its code. Use when a team is busy but outcomes are not landing, when "when will it be done" has no defensible answer, when status is green every week until it is suddenly red, when dependencies surprise you, when a RAID log is a graveyard, or when several projects must be coordinated toward one outcome (a program). Lean/agile-leaning, honest about where program scale needs predictive structure. Pairs with `/axiom-planning` (turning one workstream into an implementation plan) and `/axiom-sdlc-engineering` (process maturity, requirements traceability, formal governance). Do not load for writing code, picking an architecture, or designing a single feature.
tools
--- name: using-product-management description: Use when a Claude is taking **standing ownership** of a software product and driving it end-to-end across many sessions — discovery, strategy, specs, delivery orchestration, and value validation — deciding *what to build, why, for whom,* and *whether it worked*, with continuity, decision provenance, and an authority boundary that escalates anything irreversible or outward-facing to the human owner. Owns the product disciplines: opportunity assessme
tools
Use when designing, implementing, or auditing an MCP (Model Context Protocol) server — tool API design, idempotency under agent retry, structured error envelopes agents can recover from, schema versioning across model drift, transport reliability (stdio / HTTP), output-shape and pagination discipline, and choosing between tools / resources / prompts / sampling. Also use when an MCP server's tools confuse agents, return unstructured errors, deadlock under concurrent calls, double-execute under retry, or lose state across reconnects. Do not use for general REST/GraphQL API design (use `/web-backend`), for client-side prompt engineering or tool-loop design (use `/llm-specialist`), for general in-process plugin architecture (use `/system-architect`), or for cryptographic-provenance audit trails (use `/audit-pipelines`).
development
Use when running **SQLite or DuckDB inside an application process** as the durable store — not as a development convenience but as the production database. Use when scaling an SQLite layer that worked at low concurrency and is now hitting SQLITE_BUSY, WAL bloat, lock contention, schema-migration ceremony, or correctness gaps under multi-process writers. Use when introducing DuckDB as an OLAP complement to an OLTP SQLite store, or when picking between the two for a new component. Pairs with `/web-backend` (the API surface above the DB) and `/audit-pipelines` (when the DB is also the audit trail). Do not load for server databases (Postgres, MySQL), key-value stores, or ORM choice in isolation.