skills/data-pipeline-engineer/SKILL.md
Expert data engineer for ETL/ELT pipelines, streaming, data warehousing. Activate on: data pipeline, ETL, ELT, data warehouse, Spark, Kafka, Airflow, dbt, data modeling, star schema, streaming data, batch processing, data quality. NOT for: API design (use api-architect), ML training (use ML skills), dashboards (use design skills).
npx skillsauth add curiositech/windags-skills data-pipeline-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert data engineer specializing in ETL/ELT pipelines, streaming architectures, data warehousing, and modern data stack implementation.
| Capability | Technologies | Key Patterns | |------------|--------------|--------------| | Batch Processing | Spark, dbt, Databricks | Incremental, partitioning, Delta/Iceberg | | Stream Processing | Kafka, Flink, Spark Streaming | Watermarks, exactly-once, windowing | | Orchestration | Airflow, Dagster, Prefect | DAG design, sensors, task groups | | Data Modeling | dbt, SQL | Kimball, Data Vault, SCD | | Data Quality | Great Expectations, dbt tests | Validation suites, freshness |
BRONZE (Raw) → Exact source copy, schema-on-read, partitioned by ingestion
↓ Cleaning, Deduplication
SILVER (Cleansed) → Validated, standardized, business logic applied
↓ Aggregation, Enrichment
GOLD (Business) → Dimensional models, aggregates, ready for BI/ML
Full implementation examples in ./references/:
| File | Description |
|------|-------------|
| dbt-project-structure.md | Complete dbt layout with staging, intermediate, marts |
| airflow-dag.py | Production DAG with sensors, task groups, quality checks |
| spark-streaming.py | Kafka-to-Delta processor with windowing |
| great-expectations-suite.json | Comprehensive data quality expectation suite |
Symptom: Truncate and rebuild entire tables every run
Fix: Use incremental models with is_incremental(), partition by date
Symptom: Pipeline breaks when upstream adds/removes columns Fix: Explicit source contracts, select only needed columns in staging
Symptom: One 200-task DAG running 8 hours Fix: Domain-specific DAGs, ExternalTaskSensor for dependencies
Symptom: Bad data reaches production before detection Fix: Great Expectations or dbt tests at each layer, block on failures
Symptom: Raw data transformed without preserving original Fix: Always land raw in Bronze first, make transformations reproducible
Symptom: Manual updates needed for date filters
Fix: Use Airflow templating (e.g., ds variable) or dynamic date functions
Symptom: Unbounded state growth, OOM in long-running jobs
Fix: Add withWatermark() to handle late-arriving data
Symptom: Transient failures cause DAG failures
Fix: retries=3, retry_exponential_backoff=True, max_retry_delay
Symptom: No one knows where data comes from or who uses it Fix: dbt docs, data catalog integration, column-level lineage
Symptom: Bugs discovered by stakeholders, not engineers
Fix: dbt --target dev, sample datasets, CI/CD for models
Pipeline Design:
Data Quality:
Orchestration:
Operations:
Run ./scripts/validate-pipeline.sh to check:
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.