Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

htlin222/data-engineer

Name: data-engineer
Author: htlin222

claude.symlink/skills/data-engineer/SKILL.md

npx skillsauth add htlin222/dotfiles data-engineer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Engineering

Build scalable data pipelines and analytics infrastructure.

When to use

ETL/ELT pipeline design
Data warehouse modeling
Streaming data processing
Data quality monitoring

Airflow DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
    'owner': 'data-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
}

with DAG(
    'etl_pipeline',
    default_args=default_args,
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    start_date=days_ago(1),
    catchup=False,
) as dag:

    extract = PythonOperator(
        task_id='extract',
        python_callable=extract_data,
    )

    transform = PythonOperator(
        task_id='transform',
        python_callable=transform_data,
    )

    load = PythonOperator(
        task_id='load',
        python_callable=load_data,
    )

    validate = PythonOperator(
        task_id='validate',
        python_callable=validate_data,
    )

    extract >> transform >> load >> validate

Data warehouse schema

Star schema

-- Fact table
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    quantity INT,
    amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Dimension tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    week INT,
    day_of_week INT
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50),
    name VARCHAR(255),
    category VARCHAR(100),
    -- SCD Type 2 fields
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN
);

Spark job

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

spark = SparkSession.builder \
    .appName("ETL Job") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read with partitioning
df = spark.read \
    .option("inferSchema", "true") \
    .parquet("s3://bucket/data/") \
    .filter(col("date") >= "2024-01-01")

# Transform
result = df \
    .groupBy("category", "date") \
    .agg(
        sum("amount").alias("total_amount"),
        avg("quantity").alias("avg_quantity")
    ) \
    .repartition(10, "date")  # Optimize for writes

# Write partitioned
result.write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://bucket/output/")

Data quality

from great_expectations.core import ExpectationSuite

suite = ExpectationSuite("sales_data")

# Define expectations
suite.add_expectation(
    expect_column_values_to_not_be_null(column="sale_id")
)
suite.add_expectation(
    expect_column_values_to_be_between(
        column="amount", min_value=0, max_value=1000000
    )
)
suite.add_expectation(
    expect_column_values_to_be_unique(column="sale_id")
)

Best practices

Idempotent operations (re-runnable)
Incremental processing over full refresh
Data lineage tracking
Schema evolution handling
Cost monitoring for cloud services

Examples

Input: "Design ETL for user events" Action: Create Airflow DAG with extract/transform/load, add quality checks

Input: "Optimize slow Spark job" Action: Check partitioning, reduce shuffles, tune memory settings

htlin222/data-engineer

claude.symlink/skills/data-engineer/SKILL.md

Build ETL pipelines, data warehouses, and streaming architectures. Use for data pipeline design or analytics infrastructure.

76 stars

development

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add htlin222/dotfiles data-engineer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 8:05 PM8.0s1 file scanned

SKILL.md

name:: data-engineer
description:: Build ETL pipelines, data warehouses, and streaming architectures. Use for data pipeline design or analytics infrastructure.

Data Engineering

Build scalable data pipelines and analytics infrastructure.

When to use

ETL/ELT pipeline design
Data warehouse modeling
Streaming data processing
Data quality monitoring

Airflow DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
    'owner': 'data-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
}

with DAG(
    'etl_pipeline',
    default_args=default_args,
    schedule_interval='0 2 * * *',  # Daily at 2 AM
    start_date=days_ago(1),
    catchup=False,
) as dag:

    extract = PythonOperator(
        task_id='extract',
        python_callable=extract_data,
    )

    transform = PythonOperator(
        task_id='transform',
        python_callable=transform_data,
    )

    load = PythonOperator(
        task_id='load',
        python_callable=load_data,
    )

    validate = PythonOperator(
        task_id='validate',
        python_callable=validate_data,
    )

    extract >> transform >> load >> validate

Data warehouse schema

Star schema

-- Fact table
CREATE TABLE fact_sales (
    sale_id BIGINT PRIMARY KEY,
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    quantity INT,
    amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Dimension tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    week INT,
    day_of_week INT
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id VARCHAR(50),
    name VARCHAR(255),
    category VARCHAR(100),
    -- SCD Type 2 fields
    valid_from DATE,
    valid_to DATE,
    is_current BOOLEAN
);

Spark job

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg

spark = SparkSession.builder \
    .appName("ETL Job") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Read with partitioning
df = spark.read \
    .option("inferSchema", "true") \
    .parquet("s3://bucket/data/") \
    .filter(col("date") >= "2024-01-01")

# Transform
result = df \
    .groupBy("category", "date") \
    .agg(
        sum("amount").alias("total_amount"),
        avg("quantity").alias("avg_quantity")
    ) \
    .repartition(10, "date")  # Optimize for writes

# Write partitioned
result.write \
    .mode("overwrite") \
    .partitionBy("date") \
    .parquet("s3://bucket/output/")

Data quality

from great_expectations.core import ExpectationSuite

suite = ExpectationSuite("sales_data")

# Define expectations
suite.add_expectation(
    expect_column_values_to_not_be_null(column="sale_id")
)
suite.add_expectation(
    expect_column_values_to_be_between(
        column="amount", min_value=0, max_value=1000000
    )
)
suite.add_expectation(
    expect_column_values_to_be_unique(column="sale_id")
)

Best practices

Idempotent operations (re-runnable)
Incremental processing over full refresh
Data lineage tracking
Schema evolution handling
Cost monitoring for cloud services

Examples

Input: "Design ETL for user events" Action: Create Airflow DAG with extract/transform/load, add quality checks

Input: "Optimize slow Spark job" Action: Check partitioning, reduce shuffles, tune memory settings

Related Skills

htlin222/zettel

testing

VerifiedTrustedCommunity

Converts narrative medical text into Pocket Medicine bullet-style notes with proper abbreviations, then modularizes sections exceeding 20 lines into linked standalone files.

76SKILL.mdUpdated Apr 5, 2026

htlin222/vm-docker

devops

VerifiedTrustedCommunity

Use when deploying Docker services on the local VM (hostname: vm, Pop!_OS) with Traefik reverse proxy and Homepage dashboard. Covers crane image workflow, Traefik file-provider registration, Homepage services.yaml entries, and compose templates on the traefik-proxy network.

76SKILL.mdUpdated Apr 5, 2026

htlin222/visualization-check

development

VerifiedTrustedCommunity

Use when reviewing a data visualization or figure for clarity, checking if a graph communicates its message without additional context, or iterating on R/Python plot scripts until a naive reader can fully understand the figure.

76SKILL.mdUpdated Apr 5, 2026

htlin222/visualization-check

htlin222/vale

development

VerifiedTrustedCommunity

Runs Vale prose linter on markdown/text files and auto-fixes issues. Use when the user asks to lint, proofread, or improve writing quality of markdown or text files.

76SKILL.mdUpdated Apr 5, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/htlin222/dotfiles.git

# Copy into Claude Code skills folder (global)
cp -r dotfiles/claude.symlink/skills/data-engineer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

htlin222/dotfiles

76 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT