Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dtsong/python-data-engineering

Name: python-data-engineering
Author: dtsong

skills/data-engineering-skills/python-data-engineering/SKILL.md

npx skillsauth add dtsong/my-claude-setup python-data-engineering

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Python Data Engineering Skill

Expert guidance for Python data engineering: DataFrame libraries (Polars, Pandas, PySpark), dbt Python models, API extraction, and data validation. Assumes Python proficiency.

Scope Constraints

SQL transforms in dbt: hand off to dbt-transforms
DLT pipeline config: hand off to data-integration
Kafka/Flink streaming: hand off to event-streaming
Dagster/Airflow orchestration: hand off to data-pipelines
General Python or web development: out of scope

When to Use

Activate when: choosing between DataFrame libraries, writing Polars/Pandas/PySpark transforms, building dbt Python models, building API extraction scripts, implementing data validation (Pydantic/Pandera/GX), optimizing DataFrame memory, or converting between DataFrame formats.

Model Routing

| reasoning_demand | preferred | acceptable | minimum | |-----------------|-----------|------------|---------| | medium | Sonnet | Sonnet, Opus | Sonnet |

Core Principles

1. Type Safety First

Annotate all functions. Data pipelines process untrusted data — types catch errors before production.

from pydantic import BaseModel
from datetime import date
from decimal import Decimal
import polars as pl

class Order(BaseModel):
    order_id: str
    customer_id: str
    amount: Decimal
    order_date: date

def transform_orders(raw: pl.LazyFrame) -> pl.LazyFrame:
    return (
        raw.filter(pl.col("amount") > 0)
        .with_columns(
            pl.col("order_date").str.to_date("%Y-%m-%d"),
            pl.col("amount").cast(pl.Decimal(10, 2)),
        )
        .unique(subset=["order_id"])
    )

2. Immutable Transforms

Never mutate DataFrames in place. Return new DataFrames from every transformation for reproducibility and testability.

3. Lazy Evaluation When Possible

Prefer lazy evaluation (Polars LazyFrame, Spark DataFrame).

4. Memory Efficiency

Use appropriate dtypes (Int32 not Int64 when range allows). Stream/scan instead of loading entire files. Prefer columnar formats (Parquet, Arrow) over CSV/JSON.

5. Test Data Pipelines

Test transforms with small, representative fixtures. Use polars.testing.assert_frame_equal or pandas.testing.assert_frame_equal.

DataFrame Library Decision Matrix

| Factor | Polars | Pandas | PySpark | DuckDB (Python) | |--------|--------|--------|---------|-----------------| | Data size | Single machine (GB-TB via streaming) | Single machine (MB-GB) | Distributed cluster (TB-PB) | Single machine (GB-TB) | | Speed | Very fast (Rust, multi-threaded) | Moderate (single-threaded) | Fast at scale (distributed) | Very fast (vectorized) | | Memory | Efficient (Arrow-native, streaming) | Inefficient (copies, object dtype) | Efficient (distributed) | Efficient (out-of-core) | | API style | Expression-based, method chaining | Index-based, mixed paradigms | SQL-like DataFrame API | SQL-first, DataFrame bridge | | Lazy eval | Yes (LazyFrame) | No (eager only) | Yes (execution plan) | Yes (query plan) | | dbt support | Via DataFrame return | Native (dbt-core) | Via dbt-spark | Via dbt-duckdb | | Best for | New projects, performance-critical | Legacy code, ML integration | Big data, Databricks | Analytics, local dev |

Polars (Primary)

import polars as pl

orders = (
    pl.scan_parquet("raw/orders/*.parquet")
    .filter(
        (pl.col("status").is_in(["completed", "shipped"]))
        & (pl.col("amount") > 0)
    )
    .with_columns(
        pl.col("order_date").str.to_date("%Y-%m-%d").alias("order_date_parsed"),
        pl.col("amount").cast(pl.Decimal(10, 2)),
        (pl.col("amount") * pl.col("tax_rate")).round(2).alias("tax_amount"),
    )
    .rename({"order_date_parsed": "order_date"})
    .unique(subset=["order_id"])
    .sort("order_date")
    .collect()
)

For aggregations, joins, window functions, streaming, Arrow interop, DuckDB bridge, and performance tuning, see Polars Patterns Reference.

Pandas (Legacy/Compatibility)

Use when integrating with ML libraries or maintaining existing codebases. Prefer method chaining and vectorized operations.

For method chaining, Arrow backend, memory optimization, chunked processing, and anti-patterns, see Pandas Patterns Reference.

PySpark (Distributed)

Use when data exceeds single-machine memory or running on Databricks/Spark infrastructure.

For DataFrame API, Pandas UDFs, Spark Connect, Delta Lake, partitioning, and caching, see PySpark Patterns Reference.

dbt Python Models

Use Python models for transforms difficult in SQL: complex statistics, ML scoring, API calls, complex string parsing, or external library integration. Use SQL for joins, filters, aggregations, window functions, CTEs, and standard ELT.

# models/intermediate/int_customer_rfm.py
def model(dbt, session):
    """RFM scoring — requires Pandas groupby + qcut."""
    dbt.config(materialized="table", packages=["scikit-learn==1.4.0"])
    orders = dbt.ref("stg_orders").to_pandas()
    # ... transform with Pandas/sklearn, return DataFrame
    return result  # dbt writes to warehouse automatically

On Snowflake, session is a Snowpark Session and dbt.ref() returns a Snowpark DataFrame (not Pandas).

API Extraction

Build typed clients with Pydantic models, pagination, retry (tenacity), and rate limiting. Use httpx for sync/async HTTP.

For typed client patterns, pagination (cursor/offset/link-header), rate limiting, async extraction, and complete pipeline examples, see Extraction Patterns Reference.

Data Validation

Pydantic for row-level validation (API responses, individual records)
Pandera for DataFrame-level validation (column types, constraints)
Great Expectations for suite-level validation (warehouse tables, CI gates)
dbt tests for model-level assertions in dbt projects

For Pydantic v2 patterns, Pandera schemas, Great Expectations checkpoints, and contract testing, see Data Validation Patterns Reference.

DataFrame Interoperability

| Conversion | Method | Notes | |-----------|--------|-------| | Polars -> Pandas | polars_df.to_pandas() | Copies data | | Pandas -> Polars | pl.from_pandas(df) | Zero-copy when possible | | Polars <-> Arrow | to_arrow() / pl.from_arrow() | Zero-copy | | Polars <-> DuckDB | duckdb.sql("SELECT ... FROM df").pl() | Zero-copy via Arrow | | Spark -> Pandas | spark_df.toPandas() | Pulls to driver; use limit() for large data | | Pandas -> Spark | spark.createDataFrame(df) | Enable Arrow: spark.sql.execution.arrow.pyspark.enabled=true |

Security

Credentials: always os.environ["KEY"], never inline. Document required vars in .env.example. Use connection strings from env vars. Close httpx clients in finally blocks. Never store credentials in notebook cells.

See Security & Compliance Patterns for the full framework including security tiers.

Reference Files

Polars Patterns — LazyFrame, expressions, aggregations, joins, windows, streaming, Arrow interop, performance
Pandas Patterns — Arrow backend, method chaining, memory optimization, anti-patterns
PySpark Patterns — DataFrame API, Pandas UDFs, Spark Connect, Delta Lake, Databricks
Data Validation Patterns — Pydantic v2, Pandera, Great Expectations, contract testing
Extraction Patterns — httpx clients, async extraction, pagination, rate limiting, retry

dtsong/python-data-engineering

skills/data-engineering-skills/python-data-engineering/SKILL.md

Use this skill when writing Python code for data pipelines or transformations. Covers Polars, Pandas, PySpark DataFrames, dbt Python models, API extraction scripts, and data validation with Pydantic or Pandera. Common phrases: "Polars vs Pandas", "PySpark DataFrame", "validate this data", "Python extraction script". Do NOT use for SQL-based dbt models (use dbt-transforms) or integration architecture (use data-integration).

4 stars

development

Updated Apr 26, 2026

$ install --global

skillsauth

npx skillsauth add dtsong/my-claude-setup python-data-engineering

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 26, 2026, 4:38 AM218.5s8 files scanned

SKILL.md

name:: python-data-engineering
description:: Use this skill when writing Python code for data pipelines or transformations. Covers Polars, Pandas, PySpark DataFrames, dbt Python models, API extraction scripts, and data validation with Pydantic or Pandera. Common phrases: \"Polars vs Pandas\", \"PySpark DataFrame\", \"validate this data\", \"Python extraction script\". Do NOT use for SQL-based dbt models (use dbt-transforms) or integration architecture (use data-integration).
preferred:: sonnet
acceptable:: [sonnet, opus]
minimum:: sonnet
allow_downgrade:: false
reasoning_demand:: medium
version:: 1.0.0

Python Data Engineering Skill

Expert guidance for Python data engineering: DataFrame libraries (Polars, Pandas, PySpark), dbt Python models, API extraction, and data validation. Assumes Python proficiency.

Scope Constraints

SQL transforms in dbt: hand off to dbt-transforms
DLT pipeline config: hand off to data-integration
Kafka/Flink streaming: hand off to event-streaming
Dagster/Airflow orchestration: hand off to data-pipelines
General Python or web development: out of scope

When to Use

Model Routing

| reasoning_demand | preferred | acceptable | minimum | |-----------------|-----------|------------|---------| | medium | Sonnet | Sonnet, Opus | Sonnet |

Core Principles

1. Type Safety First

Annotate all functions. Data pipelines process untrusted data — types catch errors before production.

from pydantic import BaseModel
from datetime import date
from decimal import Decimal
import polars as pl

class Order(BaseModel):
    order_id: str
    customer_id: str
    amount: Decimal
    order_date: date

def transform_orders(raw: pl.LazyFrame) -> pl.LazyFrame:
    return (
        raw.filter(pl.col("amount") > 0)
        .with_columns(
            pl.col("order_date").str.to_date("%Y-%m-%d"),
            pl.col("amount").cast(pl.Decimal(10, 2)),
        )
        .unique(subset=["order_id"])
    )

2. Immutable Transforms

Never mutate DataFrames in place. Return new DataFrames from every transformation for reproducibility and testability.

3. Lazy Evaluation When Possible

Prefer lazy evaluation (Polars LazyFrame, Spark DataFrame).

4. Memory Efficiency

Use appropriate dtypes (Int32 not Int64 when range allows). Stream/scan instead of loading entire files. Prefer columnar formats (Parquet, Arrow) over CSV/JSON.

5. Test Data Pipelines

Test transforms with small, representative fixtures. Use polars.testing.assert_frame_equal or pandas.testing.assert_frame_equal.

DataFrame Library Decision Matrix

Polars (Primary)

import polars as pl

orders = (
    pl.scan_parquet("raw/orders/*.parquet")
    .filter(
        (pl.col("status").is_in(["completed", "shipped"]))
        & (pl.col("amount") > 0)
    )
    .with_columns(
        pl.col("order_date").str.to_date("%Y-%m-%d").alias("order_date_parsed"),
        pl.col("amount").cast(pl.Decimal(10, 2)),
        (pl.col("amount") * pl.col("tax_rate")).round(2).alias("tax_amount"),
    )
    .rename({"order_date_parsed": "order_date"})
    .unique(subset=["order_id"])
    .sort("order_date")
    .collect()
)

For aggregations, joins, window functions, streaming, Arrow interop, DuckDB bridge, and performance tuning, see Polars Patterns Reference.

Pandas (Legacy/Compatibility)

Use when integrating with ML libraries or maintaining existing codebases. Prefer method chaining and vectorized operations.

For method chaining, Arrow backend, memory optimization, chunked processing, and anti-patterns, see Pandas Patterns Reference.

PySpark (Distributed)

Use when data exceeds single-machine memory or running on Databricks/Spark infrastructure.

For DataFrame API, Pandas UDFs, Spark Connect, Delta Lake, partitioning, and caching, see PySpark Patterns Reference.

dbt Python Models

# models/intermediate/int_customer_rfm.py
def model(dbt, session):
    """RFM scoring — requires Pandas groupby + qcut."""
    dbt.config(materialized="table", packages=["scikit-learn==1.4.0"])
    orders = dbt.ref("stg_orders").to_pandas()
    # ... transform with Pandas/sklearn, return DataFrame
    return result  # dbt writes to warehouse automatically

On Snowflake, session is a Snowpark Session and dbt.ref() returns a Snowpark DataFrame (not Pandas).

API Extraction

Build typed clients with Pydantic models, pagination, retry (tenacity), and rate limiting. Use httpx for sync/async HTTP.

For typed client patterns, pagination (cursor/offset/link-header), rate limiting, async extraction, and complete pipeline examples, see Extraction Patterns Reference.

Data Validation

Pydantic for row-level validation (API responses, individual records)
Pandera for DataFrame-level validation (column types, constraints)
Great Expectations for suite-level validation (warehouse tables, CI gates)
dbt tests for model-level assertions in dbt projects

For Pydantic v2 patterns, Pandera schemas, Great Expectations checkpoints, and contract testing, see Data Validation Patterns Reference.

DataFrame Interoperability

Security

See Security & Compliance Patterns for the full framework including security tiers.

Reference Files

Polars Patterns — LazyFrame, expressions, aggregations, joins, windows, streaming, Arrow interop, performance
Pandas Patterns — Arrow backend, method chaining, memory optimization, anti-patterns
PySpark Patterns — DataFrame API, Pandas UDFs, Spark Connect, Delta Lake, Databricks
Data Validation Patterns — Pydantic v2, Pandera, Great Expectations, contract testing
Extraction Patterns — httpx clients, async extraction, pagination, rate limiting, retry

Related Skills

dtsong/enterprise-search-strategy

development

VerifiedTrustedCommunity

Use when the council needs to surface organizational knowledge buried across multiple internal sources (wikis, design docs, ADRs, past tickets, postmortems, chat archives, code repos). Plans where to look, what to cross-reference, and how to synthesize findings into evidence the council can act on. Do not use for external market research (use competitive-analysis), library evaluation (use library-evaluation), or technology trend assessment (use technology-radar).

5SKILL.mdUpdated Jun 23, 2026

dtsong/enterprise-search-strategy

dtsong/docx-to-pdf

testing

VerifiedTrustedCommunity

Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.

5SKILL.mdUpdated Jun 11, 2026

dtsong/web-security-hardening

development

VerifiedTrustedCommunity

Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.

5SKILL.mdUpdated Apr 28, 2026

dtsong/web-security-hardening

dtsong/prompt-wizard

development

VerifiedTrustedCommunity

Interactive wizard to craft effective prompts using Claude Code best practices

5SKILL.mdUpdated Apr 28, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dtsong/my-claude-setup.git

# Copy into Claude Code skills folder (global)
cp -r my-claude-setup/skills/data-engineering-skills/python-data-engineering ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dtsong/my-claude-setup

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT