Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

404kidwiz/data-engineer

Name: data-engineer
Author: 404kidwiz

data-engineer-skill/SKILL.md

npx skillsauth add 404kidwiz/claude-supercode-skills data-engineer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Engineer

Purpose

Provides expert data engineering capabilities for building scalable data pipelines, ETL/ELT workflows, data lakes, and data warehouses. Specializes in distributed data processing, stream processing, data quality, and modern data stack technologies (Airflow, dbt, Spark, Kafka) with focus on reliability and cost optimization.

When to Use

Designing end-to-end data pipelines from source to consumption layer
Implementing ETL/ELT workflows with error handling and data quality checks
Building data lakes or data warehouses with optimal storage and querying
Setting up real-time stream processing (Kafka, Flink, Kinesis)
Optimizing data infrastructure costs (storage tiering, compute efficiency)
Implementing data governance and compliance (GDPR, data lineage)
Migrating legacy data systems to modern data platforms

Quick Start

Invoke this skill when:

Designing end-to-end data pipelines from source to consumption layer
Implementing ETL/ELT workflows with error handling and data quality checks
Building data lakes or data warehouses with optimal storage and querying
Setting up real-time stream processing (Kafka, Flink, Kinesis)
Optimizing data infrastructure costs (storage tiering, compute efficiency)
Implementing data governance and compliance (GDPR, data lineage)

Do NOT invoke when:

Only SQL query optimization needed (use database-optimizer instead)
Machine learning model development (use ml-engineer or data-scientist)
Simple data analysis or visualization (use data-analyst)
Database administration tasks (use database-administrator)
API integration without data transformation (use backend-developer)

Decision Framework

Pipeline Architecture Selection

├─ Batch Processing?
│   ├─ Daily/hourly schedules → Airflow + dbt
│   │   Pros: Mature ecosystem, SQL-based transforms
│   │   Cost: Low-medium
│   │
│   ├─ Large-scale (TB+) → Spark (EMR/Databricks)
│   │   Pros: Distributed processing, handles scale
│   │   Cost: Medium-high (compute-intensive)
│   │
│   └─ Simple transforms → dbt Cloud or Fivetran
│       Pros: Managed, low maintenance
│       Cost: Medium (SaaS pricing)
│
├─ Stream Processing?
│   ├─ Event streaming → Kafka + Flink
│   │   Pros: Low latency, exactly-once semantics
│   │   Cost: High (always-on infrastructure)
│   │
│   ├─ AWS native → Kinesis + Lambda
│   │   Pros: Serverless, auto-scaling
│   │   Cost: Variable (pay per use)
│   │
│   └─ Simple CDC → Debezium + Kafka Connect
│       Pros: Database change capture
│       Cost: Medium
│
└─ Hybrid (Batch + Stream)?
    └─ Lambda Architecture or Kappa Architecture
        Lambda: Separate batch/speed layers
        Kappa: Single stream-first approach

Data Storage Selection

| Use Case | Technology | Pros | Cons | |----------|------------|------|------| | Structured analytics | Snowflake/BigQuery | SQL, fast queries | Cost at scale | | Semi-structured | Delta Lake/Iceberg | ACID, schema evolution | Complexity | | Raw storage | S3/GCS | Cheap, durable | No query engine | | Real-time | Redis/DynamoDB | Low latency | Limited analytics | | Time-series | TimescaleDB/InfluxDB | Optimized for time data | Specific use case |

ETL vs ELT Decision

| Factor | ETL (Transform First) | ELT (Load First) | |--------|----------------------|------------------| | Data volume | Small-medium | Large (TB+) | | Transformation | Complex, pre-load | SQL-based, in-warehouse | | Latency | Higher | Lower | | Cost | Compute before load | Warehouse compute | | Best for | Legacy systems | Modern cloud DW |

Core Patterns

Pattern 1: Idempotent Partition Overwrite

Use case: Safely re-run batch jobs without creating duplicates.

# PySpark example: Overwrite partition based on execution date
def write_daily_partition(df, target_table, execution_date):
    (df
     .write
     .mode("overwrite")
     .partitionBy("process_date")
     .option("partitionOverwriteMode", "dynamic")
     .format("parquet")
     .saveAsTable(target_table))

Pattern 2: Slowly Changing Dimension Type 2 (SCD2)

Use case: Track history of changes without losing past states.

-- dbt implementation of SCD2
{{ config(materialized='incremental', unique_key='user_id') }}

SELECT 
    user_id, address, email, status, updated_at,
    LEAD(updated_at, 1, '9999-12-31') OVER (
        PARTITION BY user_id ORDER BY updated_at
    ) as valid_to
FROM {{ source('raw', 'users') }}

Pattern 3: Dead Letter Queue (DLQ) for Streaming

Use case: Handle malformed messages without stopping the pipeline.

Pattern 4: Data Quality Circuit Breaker

Use case: Stop pipeline execution if data quality drops below threshold.

Quality Checklist

Data Pipeline

[ ] Idempotent (safe to retry)
[ ] Schema validation enforced
[ ] Error handling with retries
[ ] Data quality checks automated
[ ] Monitoring and alerting configured
[ ] Lineage documented

Performance

[ ] Pipeline completes within SLA (e.g., <1 hour)
[ ] Incremental loading where applicable
[ ] Partitioning strategy optimized
[ ] Query performance <30 seconds (P95)

Cost Optimization

[ ] Storage tiering implemented (hot/warm/cold)
[ ] Compute auto-scaling configured
[ ] Query cost monitoring active
[ ] Compression enabled (Parquet/ORC)

Additional Resources

Detailed Technical Reference: See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md

404kidwiz/data-engineer

data-engineer-skill/SKILL.md

Use when user needs scalable data pipeline development, ETL/ELT implementation, or data infrastructure design.

63 stars

development

Updated Mar 25, 2026

$ install --global

skillsauth

npx skillsauth add 404kidwiz/claude-supercode-skills data-engineer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 25, 2026, 5:02 PM44.7s4 files scanned

SKILL.md

name:: data-engineer
description:: Use when user needs scalable data pipeline development, ETL/ELT implementation, or data infrastructure design.

Data Engineer

Purpose

When to Use

Designing end-to-end data pipelines from source to consumption layer
Implementing ETL/ELT workflows with error handling and data quality checks
Building data lakes or data warehouses with optimal storage and querying
Setting up real-time stream processing (Kafka, Flink, Kinesis)
Optimizing data infrastructure costs (storage tiering, compute efficiency)
Implementing data governance and compliance (GDPR, data lineage)
Migrating legacy data systems to modern data platforms

Quick Start

Invoke this skill when:

Designing end-to-end data pipelines from source to consumption layer
Implementing ETL/ELT workflows with error handling and data quality checks
Building data lakes or data warehouses with optimal storage and querying
Setting up real-time stream processing (Kafka, Flink, Kinesis)
Optimizing data infrastructure costs (storage tiering, compute efficiency)
Implementing data governance and compliance (GDPR, data lineage)

Do NOT invoke when:

Only SQL query optimization needed (use database-optimizer instead)
Machine learning model development (use ml-engineer or data-scientist)
Simple data analysis or visualization (use data-analyst)
Database administration tasks (use database-administrator)
API integration without data transformation (use backend-developer)

Decision Framework

Pipeline Architecture Selection

├─ Batch Processing?
│   ├─ Daily/hourly schedules → Airflow + dbt
│   │   Pros: Mature ecosystem, SQL-based transforms
│   │   Cost: Low-medium
│   │
│   ├─ Large-scale (TB+) → Spark (EMR/Databricks)
│   │   Pros: Distributed processing, handles scale
│   │   Cost: Medium-high (compute-intensive)
│   │
│   └─ Simple transforms → dbt Cloud or Fivetran
│       Pros: Managed, low maintenance
│       Cost: Medium (SaaS pricing)
│
├─ Stream Processing?
│   ├─ Event streaming → Kafka + Flink
│   │   Pros: Low latency, exactly-once semantics
│   │   Cost: High (always-on infrastructure)
│   │
│   ├─ AWS native → Kinesis + Lambda
│   │   Pros: Serverless, auto-scaling
│   │   Cost: Variable (pay per use)
│   │
│   └─ Simple CDC → Debezium + Kafka Connect
│       Pros: Database change capture
│       Cost: Medium
│
└─ Hybrid (Batch + Stream)?
    └─ Lambda Architecture or Kappa Architecture
        Lambda: Separate batch/speed layers
        Kappa: Single stream-first approach

Data Storage Selection

ETL vs ELT Decision

Core Patterns

Pattern 1: Idempotent Partition Overwrite

Use case: Safely re-run batch jobs without creating duplicates.

# PySpark example: Overwrite partition based on execution date
def write_daily_partition(df, target_table, execution_date):
    (df
     .write
     .mode("overwrite")
     .partitionBy("process_date")
     .option("partitionOverwriteMode", "dynamic")
     .format("parquet")
     .saveAsTable(target_table))

Pattern 2: Slowly Changing Dimension Type 2 (SCD2)

Use case: Track history of changes without losing past states.

-- dbt implementation of SCD2
{{ config(materialized='incremental', unique_key='user_id') }}

SELECT 
    user_id, address, email, status, updated_at,
    LEAD(updated_at, 1, '9999-12-31') OVER (
        PARTITION BY user_id ORDER BY updated_at
    ) as valid_to
FROM {{ source('raw', 'users') }}

Pattern 3: Dead Letter Queue (DLQ) for Streaming

Use case: Handle malformed messages without stopping the pipeline.

Pattern 4: Data Quality Circuit Breaker

Use case: Stop pipeline execution if data quality drops below threshold.

Quality Checklist

Data Pipeline

[ ] Idempotent (safe to retry)
[ ] Schema validation enforced
[ ] Error handling with retries
[ ] Data quality checks automated
[ ] Monitoring and alerting configured
[ ] Lineage documented

Performance

[ ] Pipeline completes within SLA (e.g., <1 hour)
[ ] Incremental loading where applicable
[ ] Partitioning strategy optimized
[ ] Query performance <30 seconds (P95)

Cost Optimization

[ ] Storage tiering implemented (hot/warm/cold)
[ ] Compute auto-scaling configured
[ ] Query cost monitoring active
[ ] Compression enabled (Parquet/ORC)

Additional Resources

Detailed Technical Reference: See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md

Related Skills

404kidwiz/xlsx

development

VerifiedTrustedCommunity

Expert in automating Excel workflows using Node.js (ExcelJS, SheetJS) and Python (pandas, openpyxl).

63SKILL.mdUpdated Mar 25, 2026

404kidwiz/workflow-orchestrator

content-media

VerifiedTrustedCommunity

Expert in designing durable, scalable workflow systems using Temporal, Camunda, and Event-Driven Architectures.

63SKILL.mdUpdated Mar 25, 2026

404kidwiz/workflow-orchestrator

404kidwiz/wordpress-master

tools

VerifiedTrustedCommunity

Use when user needs WordPress development, theme or plugin creation, site optimization, security hardening, multisite management, or scaling WordPress from small sites to enterprise platforms.

63SKILL.mdUpdated Mar 25, 2026

404kidwiz/wordpress-master

404kidwiz/windows-infra-admin

tools

VerifiedTrustedCommunity

Expert in Windows Server, Active Directory (AD DS), Hybrid Identity (Entra ID), and PowerShell automation.

63SKILL.mdUpdated Mar 25, 2026

404kidwiz/windows-infra-admin

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/404kidwiz/claude-supercode-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-supercode-skills/data-engineer-skill ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

404kidwiz/claude-supercode-skills

63 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT