Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

qa-aman/etl-runbook

Name: etl-runbook
Author: qa-aman

skills/by-role/data-engineer/etl-runbook/SKILL.md

npx skillsauth add qa-aman/claude-skills etl-runbook

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Based on Data Pipelines Pocket Reference (Densmore) and Fundamentals of Data Engineering (Reis & Housley). An ETL runbook is the operational contract for a data job: how to monitor it, diagnose when it breaks, and recover data integrity. Densmore's standard: data pipelines fail in subtle ways - not always with an error, sometimes with silent data loss or incorrect rows. A good runbook covers job failures AND data quality failures.

The test: can an on-call engineer who didn't build this pipeline detect a problem, isolate the cause, and restore data integrity at 3am?

Workflow

Step 1: Write the job overview

Job: [job name]
Pipeline: [parent pipeline name]
Owner: [team]
Purpose: [what this job does and what breaks downstream if it doesn't run]
Schedule: [cron expression or trigger condition]
SLA: [data must be available by X time or downstream is impacted]
Typical runtime: [expected duration, e.g. 15-25 minutes]
Max runtime before alert: [e.g. 45 minutes]
Orchestrator: [Airflow / Prefect / dbt Cloud / etc.]
Job link: [direct link to job in orchestration UI]
Dashboard: [link to data quality/freshness monitoring]

Step 2: Define what "healthy" looks like

Data quality thresholds, not just job status:

Healthy signals:
- Job status: SUCCESS
- Row count: [expected range, e.g. 50,000–200,000 rows]
- Null rate on critical fields: < 0.1%
- Duplicate rate: 0% (on primary key)
- Freshness: data timestamp within [N hours] of run time
- Downstream table last_updated: matches run date

Alert on these metrics: [link to monitoring tool]

Step 3: Diagnose a failed job

Use this decision tree when a job fails or an alert fires:

1. Check job status:

[Orchestration UI link] → find the failed run → check error message
Common errors:
  - "Connection refused" → source database is unavailable (see Step 4a)
  - "Schema mismatch" → source schema changed (see Step 4b)
  - "Out of memory" → data volume spike (see Step 4c)
  - "Timeout" → source query is slow (see Step 4d)

2. Check if data arrived:

-- Did the source data land in the raw layer?
SELECT COUNT(*), MAX(extracted_at)
FROM raw.[source_table]
WHERE DATE(extracted_at) = CURRENT_DATE;
-- Expected: > 0 rows, extracted_at within last [N] hours

3. Check data quality:

-- Are there unexpected nulls?
SELECT COUNT(*) as nulls
FROM [target_table]
WHERE [critical_field] IS NULL AND DATE(created_at) = CURRENT_DATE;
-- Expected: 0

-- Are there duplicates?
SELECT [primary_key], COUNT(*) as cnt
FROM [target_table]
WHERE DATE(created_at) = CURRENT_DATE
GROUP BY [primary_key] HAVING cnt > 1;
-- Expected: 0 rows

Step 4: Resolve common failures

4a. Source database unavailable

Check: ping source system or check their status page
If transient (< 30 min): wait and trigger a manual rerun after source recovers
If prolonged: escalate to source team via [contact method]
Rerun command: [airflow trigger_dag / dbt run / etc.]

4b. Source schema changed

Symptom: column not found / type mismatch error in job logs
Diagnose: compare current source schema against schema spec [link]
Fix:
  1. Identify which column changed
  2. Update schema spec
  3. Update pipeline transformation logic
  4. Backfill affected date range (see backfill section)
Escalate: notify [data platform team] if schema change was unannounced

4c. Out of memory / volume spike

Symptom: job killed due to OOM, or runtime > [max runtime threshold]
Diagnose:
  SELECT COUNT(*) FROM [source_table] WHERE DATE(created_at) = CURRENT_DATE;
  Compare to normal range in Step 2.
Fix:
  - If volume is legitimately large: rerun with increased memory allocation
    [specific command or config change]
  - If volume is abnormal (data duplication upstream): do NOT load. Escalate.

4d. Source query timeout

Diagnose: check source query runtime vs. normal
Fix:
  1. Check if source DB has an active incident
  2. If query plan degraded: add EXPLAIN to identify missing index
  3. Short-term: increase job timeout in [config file]
  4. Long-term: file ticket to optimize source query or add index

Step 5: Backfill procedure

When data needs to be reloaded for a historical date range:

When to backfill: source was late, schema changed mid-run, data quality failure detected

Backfill command:
[exact command with date range parameters]

Idempotency: [yes/no]
  If yes: safe to rerun, uses MERGE/UPSERT on [primary key]
  If no: must truncate target partition first:
    DELETE FROM [table] WHERE DATE(partition_date) BETWEEN [start] AND [end];

Backfill duration estimate: [N rows/minute, use to estimate runtime]

After backfill, verify:
  SELECT COUNT(*), MIN(created_at), MAX(created_at)
  FROM [target_table]
  WHERE DATE(created_at) BETWEEN [start] AND [end];

Step 6: Escalation path

| Situation | Contact | Channel | |-----------|---------|---------| | Source system down > 30 min | [Source team on-call] | [PagerDuty / Slack] | | Unannounced schema change | [Source team lead] | [channel] | | Data quality failure affecting downstream reports | [Data platform lead] | [channel] | | Cannot determine root cause within 30 min | [Senior DE on-call] | [PagerDuty] |

Anti-Patterns

1. Job success ≠ data success Bad: "If the job shows SUCCESS, we're done." Good: Check row counts, null rates, and freshness even on successful runs. Silent data quality failures are more dangerous than job failures because they go undetected longer.

2. No backfill procedure Bad: "Rerun the job." Good: Specify whether reruns are idempotent, what to truncate if not, and the exact command with date range parameters.

3. Diagnose without data context Bad: "Check the logs." Good: Provide specific SQL queries to confirm whether data arrived, check for nulls, and detect duplicates.

4. Escalate immediately Bad: On-call escalates after every failure without attempting diagnosis. Good: Runbook gives the on-call engineer 30 minutes of self-service steps before escalation is appropriate.

Quality Checklist

[ ] Job overview includes schedule, SLA, expected runtime, and direct links
[ ] "Healthy" state defined with row count ranges, null rates, and freshness thresholds
[ ] Diagnosis flow covers job failure AND data quality failure (they are different)
[ ] Top 4 failure modes documented with exact diagnostic steps and SQL queries
[ ] Backfill procedure specifies idempotency, truncation if needed, and verification query
[ ] Escalation path names specific people or teams and contact channels
[ ] All commands and queries are copy-pasteable (no "run the relevant command")

qa-aman/etl-runbook

skills/by-role/data-engineer/etl-runbook/SKILL.md

Write an ETL operational runbook. Use when the user says "ETL runbook", "pipeline runbook", "data job runbook", "how to operate this ETL", "on-call guide for data pipelines", "what to do when the data job fails", "pipeline troubleshooting guide", "data ops runbook", or needs to document how to operate, monitor, and recover a data pipeline or ETL job - even if they don't explicitly say "runbook".

13 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add qa-aman/claude-skills etl-runbook

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 1:49 PM51.8s1 file scanned

SKILL.md

name:: etl-runbook
description:: >

Overview

The test: can an on-call engineer who didn't build this pipeline detect a problem, isolate the cause, and restore data integrity at 3am?

Workflow

Step 1: Write the job overview

Job: [job name]
Pipeline: [parent pipeline name]
Owner: [team]
Purpose: [what this job does and what breaks downstream if it doesn't run]
Schedule: [cron expression or trigger condition]
SLA: [data must be available by X time or downstream is impacted]
Typical runtime: [expected duration, e.g. 15-25 minutes]
Max runtime before alert: [e.g. 45 minutes]
Orchestrator: [Airflow / Prefect / dbt Cloud / etc.]
Job link: [direct link to job in orchestration UI]
Dashboard: [link to data quality/freshness monitoring]

Step 2: Define what "healthy" looks like

Data quality thresholds, not just job status:

Healthy signals:
- Job status: SUCCESS
- Row count: [expected range, e.g. 50,000–200,000 rows]
- Null rate on critical fields: < 0.1%
- Duplicate rate: 0% (on primary key)
- Freshness: data timestamp within [N hours] of run time
- Downstream table last_updated: matches run date

Alert on these metrics: [link to monitoring tool]

Step 3: Diagnose a failed job

Use this decision tree when a job fails or an alert fires:

1. Check job status:

[Orchestration UI link] → find the failed run → check error message
Common errors:
  - "Connection refused" → source database is unavailable (see Step 4a)
  - "Schema mismatch" → source schema changed (see Step 4b)
  - "Out of memory" → data volume spike (see Step 4c)
  - "Timeout" → source query is slow (see Step 4d)

2. Check if data arrived:

-- Did the source data land in the raw layer?
SELECT COUNT(*), MAX(extracted_at)
FROM raw.[source_table]
WHERE DATE(extracted_at) = CURRENT_DATE;
-- Expected: > 0 rows, extracted_at within last [N] hours

3. Check data quality:

-- Are there unexpected nulls?
SELECT COUNT(*) as nulls
FROM [target_table]
WHERE [critical_field] IS NULL AND DATE(created_at) = CURRENT_DATE;
-- Expected: 0

-- Are there duplicates?
SELECT [primary_key], COUNT(*) as cnt
FROM [target_table]
WHERE DATE(created_at) = CURRENT_DATE
GROUP BY [primary_key] HAVING cnt > 1;
-- Expected: 0 rows

Step 4: Resolve common failures

4a. Source database unavailable

Check: ping source system or check their status page
If transient (< 30 min): wait and trigger a manual rerun after source recovers
If prolonged: escalate to source team via [contact method]
Rerun command: [airflow trigger_dag / dbt run / etc.]

4b. Source schema changed

Symptom: column not found / type mismatch error in job logs
Diagnose: compare current source schema against schema spec [link]
Fix:
  1. Identify which column changed
  2. Update schema spec
  3. Update pipeline transformation logic
  4. Backfill affected date range (see backfill section)
Escalate: notify [data platform team] if schema change was unannounced

4c. Out of memory / volume spike

Symptom: job killed due to OOM, or runtime > [max runtime threshold]
Diagnose:
  SELECT COUNT(*) FROM [source_table] WHERE DATE(created_at) = CURRENT_DATE;
  Compare to normal range in Step 2.
Fix:
  - If volume is legitimately large: rerun with increased memory allocation
    [specific command or config change]
  - If volume is abnormal (data duplication upstream): do NOT load. Escalate.

4d. Source query timeout

Diagnose: check source query runtime vs. normal
Fix:
  1. Check if source DB has an active incident
  2. If query plan degraded: add EXPLAIN to identify missing index
  3. Short-term: increase job timeout in [config file]
  4. Long-term: file ticket to optimize source query or add index

Step 5: Backfill procedure

When data needs to be reloaded for a historical date range:

When to backfill: source was late, schema changed mid-run, data quality failure detected

Backfill command:
[exact command with date range parameters]

Idempotency: [yes/no]
  If yes: safe to rerun, uses MERGE/UPSERT on [primary key]
  If no: must truncate target partition first:
    DELETE FROM [table] WHERE DATE(partition_date) BETWEEN [start] AND [end];

Backfill duration estimate: [N rows/minute, use to estimate runtime]

After backfill, verify:
  SELECT COUNT(*), MIN(created_at), MAX(created_at)
  FROM [target_table]
  WHERE DATE(created_at) BETWEEN [start] AND [end];

Step 6: Escalation path

Anti-Patterns

2. No backfill procedure Bad: "Rerun the job." Good: Specify whether reruns are idempotent, what to truncate if not, and the exact command with date range parameters.

3. Diagnose without data context Bad: "Check the logs." Good: Provide specific SQL queries to confirm whether data arrived, check for nulls, and detect duplicates.

Quality Checklist

[ ] Job overview includes schedule, SLA, expected runtime, and direct links
[ ] "Healthy" state defined with row count ranges, null rates, and freshness thresholds
[ ] Diagnosis flow covers job failure AND data quality failure (they are different)
[ ] Top 4 failure modes documented with exact diagnostic steps and SQL queries
[ ] Backfill procedure specifies idempotency, truncation if needed, and verification query
[ ] Escalation path names specific people or teams and contact channels
[ ] All commands and queries are copy-pasteable (no "run the relevant command")

Related Skills

qa-aman/webinar-planner

development

VerifiedTrustedCommunity

Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/webinar-planner

qa-aman/thought-leadership-writer

development

VerifiedTrustedCommunity

Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/thought-leadership-writer

qa-aman/social-calendar

development

VerifiedTrustedCommunity

Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.

13SKILL.mdUpdated May 5, 2026

qa-aman/social-calendar

qa-aman/seo-article-writer

development

VerifiedTrustedCommunity

Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.

13SKILL.mdUpdated May 5, 2026

qa-aman/seo-article-writer

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/qa-aman/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/skills/by-role/data-engineer/etl-runbook ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

qa-aman/claude-skills

13 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT