CI/CD Playbook Skill

Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.

A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.

Required Inputs

Ask for these if not already provided:

Service name and brief description
Tech stack — language, framework, containerisation (Docker, etc.)
Source control — GitHub / GitLab / Bitbucket, branching strategy
CI platform — GitHub Actions / CircleCI / Jenkins / BuildKite / other
CD platform / deployment target — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
Environments — e.g. dev, staging, production (and any canary / feature environments)
Deployment frequency — how often does the team ship?
Any existing gates — manual approvals, smoke tests, feature flags
On-call setup — who's responsible during deploys?

Output Format

CI/CD Playbook: [Service Name]

Service: [Name] | Team: [Team name] Last updated: [Date] | Owner: [Name / role] Pipeline platform: [CI tool] → [CD tool / platform]

Overview

[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]

Deployment frequency: [Multiple times per day / Daily / Weekly / On-demand] Average pipeline duration: [X minutes] Rollback time (p95): [X minutes]

Pipeline Stages

[Branch push]
    │
    ▼
[1. Build & Lint] ──fail──▶ ❌ Block PR
    │
    ▼
[2. Unit Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[3. Integration Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
    │
    ▼
[5. Build Artefact / Container Image]
    │
    ▼
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
    │
    ▼
[7. Smoke Tests (Staging)]
    │
    ▼
[8. Manual Approval Gate] ──(if required)
    │
    ▼
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
    │
    ▼
[10. Post-deploy checks]

Stage Definitions

Stage 1 — Build & Lint

What runs: [Build command] + [Linter — e.g. ESLint, golangci-lint, flake8] Trigger: Every commit to any branch Blocking: Yes — PR cannot be merged if this fails Typical duration: [X minutes] Owner if it fails: PR author

Common failure causes:

[e.g. Missing dependency — run npm install locally before pushing]
[e.g. Lint rule violation — run npm run lint --fix to auto-fix most issues]

Stage 2 — Unit Tests

What runs: [Test command — e.g. npm test, go test ./..., pytest] Coverage gate: [X]% minimum — pipeline fails below this threshold Trigger: Every commit Blocking: Yes Typical duration: [X minutes]

Coverage report: [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]

Stage 3 — Integration Tests

What runs: [Test suite description — e.g. "API integration tests against a test database using Docker Compose"] Environment: [Ephemeral test environment / shared test DB / etc.] Trigger: Every commit to main and feature branches targeting main Blocking: Yes Typical duration: [X minutes]

If slow: [e.g. "Integration tests can be skipped locally with SKIP_INTEGRATION=true — never skip in CI"]

Stage 4 — Security Scan

Tools: [e.g. Snyk, Trivy, OWASP Dependency Check, Semgrep] What it checks: [Dependency vulnerabilities / SAST / secrets detection — list what applies] Blocking on: Critical and High severity findings Non-blocking on: Medium and Low (flagged, not blocking) Trigger: Every commit to main

How to handle a flagged vulnerability:

Check if a fix is available — upgrade the dependency
If no fix available, open a security ticket and add a suppression with justification
Never suppress without a ticket and owner

Stage 5 — Build Artefact

What is produced: [Docker image / binary / zip — be specific] Registry: [ECR / GCR / Docker Hub / Artifactory — URL] Tagging convention: [service-name]:[git-sha] (also tagged :latest on main) Trigger: Commits to main only (not feature branches)

Stage 6 — Deploy to Staging

Deployment method: [e.g. Helm upgrade / kubectl apply / ecs deploy / Terraform apply] Staging URL: [URL] Trigger: Automatic on successful artefact build from main Who can deploy to staging: Any engineer (automatic)

Environment variables: Managed in [Vault / AWS SSM / GitHub Secrets / etc.] Staging is not production: [Any differences in config, scale, or data — state them here]

Stage 7 — Smoke Tests (Staging)

What runs: [Description — e.g. "10 critical path tests covering login, core API endpoints, and payment flow"] Tool: [e.g. Playwright / Postman / custom script] Pass criteria: All smoke tests pass within [X seconds] timeout Blocking: Yes — production deploy will not proceed if smoke tests fail

Smoke test suite location: [Link to test files or folder]

Stage 8 — Manual Approval Gate

Required for: [Production deploys / deploys affecting >X% of traffic / deploys to specific regions] Who can approve: [e.g. Any engineer on the team / Lead engineer / On-call engineer] Approval timeout: [e.g. 24 hours — auto-cancelled if no approval] How to approve: [GitHub Actions approve step / Slack command / other — with link]

When to withhold approval:

Active incident in production
Deploy is outside the deployment window (see below)
On-call engineer has not been notified

Stage 9 — Deploy to Production

Deployment method: [Same as staging or different — specify] Deployment window: [e.g. Monday–Thursday 09:00–16:00 UTC — no deploys on Fridays or before bank holidays] Canary / progressive rollout: [Yes — X% initial traffic, full rollout after Y minutes / No — full deploy] Deployment notifications: [Slack channel — #deployments]

Who is on-call during deploy: Deploying engineer is responsible until post-deploy checks pass.

Stage 10 — Post-Deploy Checks

Automated checks (run for [X minutes] after deploy):

[ ] Error rate: <[X]% (baseline: [Y]%)
[ ] P99 latency: <[X]ms (baseline: [Y]ms)
[ ] [Key business metric]: within [X]% of baseline

Where to watch: [Datadog / Grafana / CloudWatch dashboard — link]

If a check fails: See Rollback Procedure below.

Environments

| Environment | Purpose | Deploy trigger | URL | Data | |---|---|---|---|---| | Dev | Local development | Manual | localhost | Seeded test data | | Staging | Pre-production validation | Automatic (main) | [URL] | Anonymised prod copy | | Production | Live traffic | Manual approval | [URL] | Live data |

Branching Strategy

Model: [Trunk-based / GitFlow / GitHub Flow — describe briefly]

| Branch | Purpose | Who merges | Deploy target | |---|---|---|---| | main | Production-ready code | PR + review | Staging → Production | | feature/* | Feature development | Author | None (CI only) | | hotfix/* | Critical production fixes | Lead engineer | Can bypass staging gate with approval |

Hotfix process: [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]

Rollback Procedure

Automated rollback: [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]

Manual rollback steps:

# 1. Identify the last known good image tag
[command to list recent deployments]

# 2. Deploy the previous version
[deployment command with previous tag]

# 3. Confirm rollback is live
[smoke test command or health check URL]

# 4. Notify the team
[Slack command or template]

Rollback decision authority: Any engineer on-call can initiate a rollback without waiting for approval.

After a rollback:

Create a post-deploy incident report (see [incident-postmortem skill])
Do not re-deploy the same commit without fixing the root cause
Notify [stakeholder / support team] of the rollback and expected fix timeline

Secrets and Configuration Management

Secret store: [Vault / AWS SSM / GitHub Secrets / Doppler — specify] How to add a new secret:

[Step 1]
[Step 2] Who has access: [Role or team] Rotation policy: [How often secrets are rotated and who owns it]

Never do: Commit secrets to source control, even in .env files. The pipeline includes secret scanning (Stage 4) which will flag this.

Common Failures and Fixes

| Failure | Likely cause | Fix | |---|---|---| | Build fails with "module not found" | Dependency not installed | Run [install command] and commit lock file | | Integration tests timeout | Test DB not seeded / external service down | Check [service] status; re-run pipeline | | Smoke tests fail after staging deploy | Environment variable missing | Check [config location]; compare staging and prod env vars | | Production deploy stuck at approval | Approver not notified | Tag @[on-call handle] in #deployments | | Post-deploy error rate spike | Bad deploy / upstream dependency | Check [dashboard]; initiate rollback if >5 min |

On-Call Responsibilities During Deploy

The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
If you cannot monitor after deploying, hand off explicitly to another engineer in #deployments
For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying

Anti-Patterns

[ ] Do not describe a rollback procedure that has never been tested — a theoretical rollback is not a rollback plan; test it in staging before production
[ ] Do not allow deploys on Fridays or before holidays without an explicit on-call engineer who will monitor through the weekend
[ ] Do not commit secrets to source control even in non-production branches — secret scanning in the pipeline catches this, but prevention is the standard
[ ] Do not skip post-deploy monitoring after a production deploy — the deploying engineer must watch error rates and latency for the specified observation window
[ ] Do not suppress a security scan finding without a linked ticket and a named owner — suppressions without accountability accumulate into unmanaged risk

Quality Checks

[ ] Every stage has a clear owner when it fails
[ ] Rollback procedure is tested — not theoretical
[ ] Secrets management section names the actual tool used (not "use secrets management")
[ ] Deployment window is specific — not "during business hours"
[ ] Post-deploy check thresholds are calibrated to actual baseline metrics

CI/CD Playbook Skill

Produce a complete, actionable CI/CD playbook for a service or team — covering everything a new engineer needs to understand, contribute to, and operate the pipeline safely.

A good playbook is not a diagram. It is a document that answers: what runs, when, why, who owns it, and what to do when it breaks.

Required Inputs

Ask for these if not already provided:

Service name and brief description
Tech stack — language, framework, containerisation (Docker, etc.)
Source control — GitHub / GitLab / Bitbucket, branching strategy
CI platform — GitHub Actions / CircleCI / Jenkins / BuildKite / other
CD platform / deployment target — Kubernetes, ECS, Lambda, Heroku, VMs, etc.
Environments — e.g. dev, staging, production (and any canary / feature environments)
Deployment frequency — how often does the team ship?
Any existing gates — manual approvals, smoke tests, feature flags
On-call setup — who's responsible during deploys?

Output Format

CI/CD Playbook: [Service Name]

Service: [Name] | Team: [Team name] Last updated: [Date] | Owner: [Name / role] Pipeline platform: [CI tool] → [CD tool / platform]

Overview

[2–3 sentences describing what this service does and why the CI/CD pipeline is structured the way it is. Include the deployment target and how frequently the team ships.]

Deployment frequency: [Multiple times per day / Daily / Weekly / On-demand] Average pipeline duration: [X minutes] Rollback time (p95): [X minutes]

Pipeline Stages

[Branch push]
    │
    ▼
[1. Build & Lint] ──fail──▶ ❌ Block PR
    │
    ▼
[2. Unit Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[3. Integration Tests] ──fail──▶ ❌ Block PR
    │
    ▼
[4. Security Scan] ──fail──▶ ⚠️ [Block / Warn — specify]
    │
    ▼
[5. Build Artefact / Container Image]
    │
    ▼
[6. Deploy to Staging] ──fail──▶ ❌ Block promotion
    │
    ▼
[7. Smoke Tests (Staging)]
    │
    ▼
[8. Manual Approval Gate] ──(if required)
    │
    ▼
[9. Deploy to Production] ──fail──▶ 🔁 Auto-rollback (if configured)
    │
    ▼
[10. Post-deploy checks]

Stage Definitions

Stage 1 — Build & Lint

Common failure causes:

[e.g. Missing dependency — run npm install locally before pushing]
[e.g. Lint rule violation — run npm run lint --fix to auto-fix most issues]

Stage 2 — Unit Tests

Coverage report: [Where to find it — e.g. uploaded to Codecov, available in CI artifacts]

Stage 3 — Integration Tests

If slow: [e.g. "Integration tests can be skipped locally with SKIP_INTEGRATION=true — never skip in CI"]

Stage 4 — Security Scan

How to handle a flagged vulnerability:

Check if a fix is available — upgrade the dependency
If no fix available, open a security ticket and add a suppression with justification
Never suppress without a ticket and owner

Stage 5 — Build Artefact

Stage 6 — Deploy to Staging

Environment variables: Managed in [Vault / AWS SSM / GitHub Secrets / etc.] Staging is not production: [Any differences in config, scale, or data — state them here]

Stage 7 — Smoke Tests (Staging)

Smoke test suite location: [Link to test files or folder]

Stage 8 — Manual Approval Gate

When to withhold approval:

Active incident in production
Deploy is outside the deployment window (see below)
On-call engineer has not been notified

Stage 9 — Deploy to Production

Who is on-call during deploy: Deploying engineer is responsible until post-deploy checks pass.

Stage 10 — Post-Deploy Checks

Automated checks (run for [X minutes] after deploy):

[ ] Error rate: <[X]% (baseline: [Y]%)
[ ] P99 latency: <[X]ms (baseline: [Y]ms)
[ ] [Key business metric]: within [X]% of baseline

Where to watch: [Datadog / Grafana / CloudWatch dashboard — link]

If a check fails: See Rollback Procedure below.

Environments

Branching Strategy

Model: [Trunk-based / GitFlow / GitHub Flow — describe briefly]

Hotfix process: [Describe when and how to use a hotfix branch — what level of incident justifies bypassing the standard process]

Rollback Procedure

Automated rollback: [Yes — triggered if post-deploy error rate exceeds [X]% / No — manual only]

Manual rollback steps:

# 1. Identify the last known good image tag
[command to list recent deployments]

# 2. Deploy the previous version
[deployment command with previous tag]

# 3. Confirm rollback is live
[smoke test command or health check URL]

# 4. Notify the team
[Slack command or template]

Rollback decision authority: Any engineer on-call can initiate a rollback without waiting for approval.

After a rollback:

Create a post-deploy incident report (see [incident-postmortem skill])
Do not re-deploy the same commit without fixing the root cause
Notify [stakeholder / support team] of the rollback and expected fix timeline

Secrets and Configuration Management

Secret store: [Vault / AWS SSM / GitHub Secrets / Doppler — specify] How to add a new secret:

[Step 1]
[Step 2] Who has access: [Role or team] Rotation policy: [How often secrets are rotated and who owns it]

Never do: Commit secrets to source control, even in .env files. The pipeline includes secret scanning (Stage 4) which will flag this.

Common Failures and Fixes

On-Call Responsibilities During Deploy

The deploying engineer is responsible for monitoring post-deploy checks for [X minutes] after a production deploy
If you cannot monitor after deploying, hand off explicitly to another engineer in #deployments
For deploys outside business hours: only hotfixes — always page the on-call engineer before deploying

Anti-Patterns

[ ] Do not describe a rollback procedure that has never been tested — a theoretical rollback is not a rollback plan; test it in staging before production
[ ] Do not allow deploys on Fridays or before holidays without an explicit on-call engineer who will monitor through the weekend
[ ] Do not commit secrets to source control even in non-production branches — secret scanning in the pipeline catches this, but prevention is the standard
[ ] Do not skip post-deploy monitoring after a production deploy — the deploying engineer must watch error rates and latency for the specified observation window
[ ] Do not suppress a security scan finding without a linked ticket and a named owner — suppressions without accountability accumulate into unmanaged risk

Quality Checks

[ ] Every stage has a clear owner when it fails
[ ] Rollback procedure is tested — not theoretical
[ ] Secrets management section names the actual tool used (not "use secrets management")
[ ] Deployment window is specific — not "during business hours"
[ ] Post-deploy check thresholds are calibrated to actual baseline metrics

Adoption

mohitagw15856/cicd-playbook

$ install --global

Security Scan Results

SKILL.md

CI/CD Playbook Skill

Required Inputs

Output Format

CI/CD Playbook: [Service Name]

Overview

Pipeline Stages

Stage Definitions

Stage 1 — Build & Lint

Stage 2 — Unit Tests

Stage 3 — Integration Tests

Stage 4 — Security Scan

Stage 5 — Build Artefact

Stage 6 — Deploy to Staging

Stage 7 — Smoke Tests (Staging)

Stage 8 — Manual Approval Gate

Stage 9 — Deploy to Production

Stage 10 — Post-Deploy Checks

Environments

Branching Strategy

Rollback Procedure

Secrets and Configuration Management

Common Failures and Fixes

On-Call Responsibilities During Deploy

Anti-Patterns

Quality Checks

Related Skills

mohitagw15856/win-loss-analysis

mohitagw15856/which-skill

mohitagw15856/vuln-triage

mohitagw15856/voice-of-customer-program

mohitagw15856/cicd-playbook

$ install --global

Security Scan Results

SKILL.md

CI/CD Playbook Skill

Required Inputs

Output Format

CI/CD Playbook: [Service Name]

Overview

Pipeline Stages

Stage Definitions

Stage 1 — Build & Lint

Stage 2 — Unit Tests

Stage 3 — Integration Tests

Stage 4 — Security Scan

Stage 5 — Build Artefact

Stage 6 — Deploy to Staging

Stage 7 — Smoke Tests (Staging)

Stage 8 — Manual Approval Gate

Stage 9 — Deploy to Production

Stage 10 — Post-Deploy Checks

Environments

Branching Strategy

Rollback Procedure

Secrets and Configuration Management

Common Failures and Fixes

On-Call Responsibilities During Deploy

Anti-Patterns

Quality Checks

Related Skills

mohitagw15856/win-loss-analysis

mohitagw15856/which-skill

mohitagw15856/vuln-triage

mohitagw15856/voice-of-customer-program