data-engineer-skill/SKILL.md
Use when user needs scalable data pipeline development, ETL/ELT implementation, or data infrastructure design.
npx skillsauth add 404kidwiz/claude-supercode-skills data-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Provides expert data engineering capabilities for building scalable data pipelines, ETL/ELT workflows, data lakes, and data warehouses. Specializes in distributed data processing, stream processing, data quality, and modern data stack technologies (Airflow, dbt, Spark, Kafka) with focus on reliability and cost optimization.
Invoke this skill when:
Do NOT invoke when:
├─ Batch Processing?
│ ├─ Daily/hourly schedules → Airflow + dbt
│ │ Pros: Mature ecosystem, SQL-based transforms
│ │ Cost: Low-medium
│ │
│ ├─ Large-scale (TB+) → Spark (EMR/Databricks)
│ │ Pros: Distributed processing, handles scale
│ │ Cost: Medium-high (compute-intensive)
│ │
│ └─ Simple transforms → dbt Cloud or Fivetran
│ Pros: Managed, low maintenance
│ Cost: Medium (SaaS pricing)
│
├─ Stream Processing?
│ ├─ Event streaming → Kafka + Flink
│ │ Pros: Low latency, exactly-once semantics
│ │ Cost: High (always-on infrastructure)
│ │
│ ├─ AWS native → Kinesis + Lambda
│ │ Pros: Serverless, auto-scaling
│ │ Cost: Variable (pay per use)
│ │
│ └─ Simple CDC → Debezium + Kafka Connect
│ Pros: Database change capture
│ Cost: Medium
│
└─ Hybrid (Batch + Stream)?
└─ Lambda Architecture or Kappa Architecture
Lambda: Separate batch/speed layers
Kappa: Single stream-first approach
| Use Case | Technology | Pros | Cons | |----------|------------|------|------| | Structured analytics | Snowflake/BigQuery | SQL, fast queries | Cost at scale | | Semi-structured | Delta Lake/Iceberg | ACID, schema evolution | Complexity | | Raw storage | S3/GCS | Cheap, durable | No query engine | | Real-time | Redis/DynamoDB | Low latency | Limited analytics | | Time-series | TimescaleDB/InfluxDB | Optimized for time data | Specific use case |
| Factor | ETL (Transform First) | ELT (Load First) | |--------|----------------------|------------------| | Data volume | Small-medium | Large (TB+) | | Transformation | Complex, pre-load | SQL-based, in-warehouse | | Latency | Higher | Lower | | Cost | Compute before load | Warehouse compute | | Best for | Legacy systems | Modern cloud DW |
Use case: Safely re-run batch jobs without creating duplicates.
# PySpark example: Overwrite partition based on execution date
def write_daily_partition(df, target_table, execution_date):
(df
.write
.mode("overwrite")
.partitionBy("process_date")
.option("partitionOverwriteMode", "dynamic")
.format("parquet")
.saveAsTable(target_table))
Use case: Track history of changes without losing past states.
-- dbt implementation of SCD2
{{ config(materialized='incremental', unique_key='user_id') }}
SELECT
user_id, address, email, status, updated_at,
LEAD(updated_at, 1, '9999-12-31') OVER (
PARTITION BY user_id ORDER BY updated_at
) as valid_to
FROM {{ source('raw', 'users') }}
Use case: Handle malformed messages without stopping the pipeline.
Use case: Stop pipeline execution if data quality drops below threshold.
development
Expert in automating Excel workflows using Node.js (ExcelJS, SheetJS) and Python (pandas, openpyxl).
content-media
Expert in designing durable, scalable workflow systems using Temporal, Camunda, and Event-Driven Architectures.
tools
Use when user needs WordPress development, theme or plugin creation, site optimization, security hardening, multisite management, or scaling WordPress from small sites to enterprise platforms.
tools
Expert in Windows Server, Active Directory (AD DS), Hybrid Identity (Entra ID), and PowerShell automation.