Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

JosiahSiegel/ml-data-pipeline

Name: ml-data-pipeline
Author: JosiahSiegel

plugins/ml-master/skills/ml-data-pipeline/SKILL.md

npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-data-pipeline

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Data Pipeline

Overview

Use this skill for data ingestion, validation, preprocessing, feature engineering, dataset versioning, feature stores, batch and streaming pipelines, and data-quality monitoring. In ML, data pipeline correctness is often more important than model sophistication. A pipeline must produce leakage-safe training data and consistent serving features.

Data Data Pipeline Invariants

Raw data is immutable or snapshot-addressable.
Schemas, statistics, and quality expectations are validated before training and serving.
Transformations are versioned and reproducible.
Splits are created before leakage-prone operations such as oversampling, target encoding, feature selection, or normalization.
Time-dependent features use only information available at prediction time.
Offline training features match online serving features.
Sensitive data is minimized, access-controlled, encrypted, and audited.

Ingestion and Storage

Choose storage based on data shape and access pattern. Object storage with Parquet/Arrow is a strong default for tabular batch ML. Delta Lake, Apache Iceberg, or Hudi add ACID tables, schema evolution, and time travel. Use warehouses for governed SQL features, vector stores for embedding retrieval, and streaming logs for online behavior. Store raw, cleaned, feature, and model-ready layers separately. For Azure Storage pointer blobs used by ADF to pass Azure ML code asset versions, load ml-azureml-adf-automation.

For large datasets, prefer columnar formats, partitioning by time or high-level domain, compression, predicate pushdown, and manifest files. Avoid many tiny files; compact when necessary. Record dataset snapshot identifiers in every training run.

Feast Feature Store Blueprint

Feast is a modular feature store for maintaining online-offline feature consistency.

1. Feature Store Configuration (`feature_store.yaml`)

project: fraud_detection
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
offline_store:
  type: file

2. Feature Definitions (`features.py`)

from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=ValueType.INT64, join_keys=["user_id"])

user_transactions_source = FileSource(
    path="data/user_transactions.parquet",
    event_timestamp_column="timestamp",
    created_timestamp_column="created_timestamp",
)

user_transactions_fv = FeatureView(
    name="user_transactions_feature_view",
    entities=[user],
    ttl=timedelta(days=90),
    schema=[
        Field(name="transaction_count_30d", dtype=Int64),
        Field(name="total_amount_30d", dtype=Float32),
    ],
    online=True,
    source=user_transactions_source,
)

Processing Engines

| Engine | Best fit | |---|---| | pandas | Small to medium in-memory exploration and simple pipelines | | Polars | Fast local/lazy columnar processing, larger-than-pandas workloads | | Spark | Large distributed ETL, lakehouse workflows, SQL + MLlib integration | | Dask | Python-native distributed arrays/dataframes and custom workloads | | Ray Data | ML-centric distributed preprocessing integrated with Ray Train/Tune/Serve | | Beam/Flink/Spark Streaming | Streaming or unified batch/stream dataflows | | Airflow/Prefect/Dagster | Orchestration, scheduling, retries, lineage, and dependency management |

High-Performance Polars Lazy Aggregation Pipeline

Polars lazy evaluation optimizes the execution plan using predicate and projection pushdowns.

import polars as pl

def compute_rolling_user_features(transactions_path: str):
    lazy_df = (
        pl.scan_parquet(transactions_path)
        # Cast timestamp for windowing
        .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
        .sort("timestamp")
        # Define rolling calculation window
        .group_by_dynamic(
            index_column="timestamp",
            every="1d",
            period="30d",
            group_by="user_id"
        )
        .agg([
            pl.col("amount").count().alias("transaction_count_30d"),
            pl.col("amount").sum().alias("total_amount_30d"),
            pl.col("amount").mean().alias("avg_amount_30d")
        ])
        .filter(pl.col("user_id").is_not_null())
    )
    # Collect executes the query optimization and loads results into memory
    return lazy_df.collect()

Spark Dataframe Optimization Recipes

Prevent typical distributed training bottlenecks such as data skew and excessive shuffle overhead.

1. Salting to Prevent Skewed Joins

from pyspark.sql import functions as F

# Adding a salt column to distribute skewed key values evenly across partitions
skewed_df = skewed_df.with_columns(
    (F.rand() * 10).cast("int").alias("salt")
)
lookup_df = lookup_df.with_columns(
    F.explode(F.array([F.lit(i) for i in range(10)])).alias("salt")
)

# Join on key AND salt to distribute the workload
joined_df = skewed_df.join(lookup_df, ["join_key", "salt"], "inner").drop("salt")

2. Broadcast Join for Lookup Tables

from pyspark.sql.functions import broadcast

# Explicitly broadcast small dimension dataframe to executors to avoid shuffling large fact table
optimized_joined = large_fact_df.join(broadcast(small_lookup_df), "entity_id", "inner")

Data Validation

Validate at ingestion, feature generation, training, and serving. Check schema, types, ranges, nulls, uniqueness, duplicates, categorical domains, cardinality, label distribution, timestamp monotonicity, referential integrity, text/image/audio validity, and embedding norms. Tools include Great Expectations, TensorFlow Data Validation, Deequ, pandera, dbt tests, and custom assertions.

Quality checks should fail fast for contract violations and warn for distribution changes that need investigation. Store validation reports with training artifacts. For production, monitor both raw features and post-transform model inputs.

Feature Engineering

Feature engineering should be tied to the prediction time. For temporal data, compute rolling windows with correct cutoffs, delays, and late-arriving data handling. For target encoding, fit encoders inside cross-validation folds and use smoothing. For categorical features, choose native categorical support, one-hot, hashing, embeddings, or target encoding based on cardinality and model type. For text, version tokenizers and vocabularies. For images/audio, store preprocessing parameters and augmentations.

Prevent leakage by asking: would this feature be known at the moment the model makes the prediction? If not, exclude it or redesign the target and prediction time.

DVC (Data Version Control) Ingest Workflow

DVC tags large datasets to Git commits via lightweight metadata files, avoiding bloat.

1. Initialize and Add Storage

dvc init
dvc remote add -d myremote s3://my-dvc-bucket/raw-data

2. Track a New Dataset Version

# Add dataset to DVC tracking (creates data.parquet.dvc)
dvc add data/raw_transactions.parquet

# Commit DVC metadata file to Git
git add data/raw_transactions.parquet.dvc data/.gitignore
git commit -m "Track transactions dataset v1.0.0 via DVC"

# Push raw binaries to remote cloud storage
dvc push

3. Retrieve Tracked Version on Another Worker

git pull
dvc pull

Streaming and Online Pipelines

Streaming ML pipelines need event-time handling, watermarks, deduplication, ordering strategy, late data behavior, exactly-once or at-least-once semantics, and replayability. Separate online feature updates from training-label generation. Keep a path to backfill or replay from durable logs when feature logic changes.

Security and Privacy

Minimize sensitive fields, tokenize or hash identifiers where appropriate, and preserve joinability only when needed. Apply access controls by data layer. Avoid exporting raw PII into experiment trackers or model artifacts. Document retention and deletion policies.

Common Pipeline Failures

Train/serving skew from separate preprocessing implementations.
Future leakage in rolling features, aggregates, or target encoders.
Silent schema drift that coerces strings, nulls, or categories incorrectly.
Duplicate entities across train and validation.
Oversampling or normalization before splitting.
Non-versioned datasets causing irreproducible training.
Streaming features that cannot be backfilled for training.

Sources

Great Expectations documentation: https://docs.greatexpectations.io/
TensorFlow Data Validation documentation: https://www.tensorflow.org/tfx/data_validation
Feast documentation: https://docs.feast.dev/
DVC documentation: https://dvc.org/doc
Apache Spark MLlib and SQL docs: https://spark.apache.org/docs/latest/
Polars documentation: https://docs.pola.rs/

JosiahSiegel/ml-data-pipeline

plugins/ml-master/skills/ml-data-pipeline/SKILL.md

This skill should be used when the user asks to ingest, clean, validate, transform, version, monitor, or serve ML data and features. PROACTIVELY activate for: (1) data ingestion, preprocessing, feature engineering, leakage prevention, train/serving skew, (2) Spark, Dask, Polars, pandas, Ray Data, streaming pipelines, (3) Great Expectations, TFDV, Deequ, data quality and validation, (4) DVC, lakehouse tables, dataset versioning, lineage, reproducibility, (5) Feast, Tecton, Hopsworks feature stores, point-in-time joins, online/offline features. Provides: scalable, reproducible, leakage-safe ML data pipeline design.

38 stars

testing

Updated May 28, 2026

$ install --global

skillsauth

npx skillsauth add JosiahSiegel/claude-plugin-marketplace ml-data-pipeline

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 28, 2026, 7:34 AM78.0s1 file scanned

SKILL.md

name:: ml-data-pipeline
description:: |
This skill should be used when the user asks to ingest, clean, validate, transform, version, monitor, or serve ML data and features. PROACTIVELY activate for:: (1) data ingestion, preprocessing, feature engineering, leakage prevention, train/serving skew, (2) Spark, Dask, Polars, pandas, Ray Data, streaming pipelines, (3) Great Expectations, TFDV, Deequ, data quality and validation, (4) DVC, lakehouse tables, dataset versioning, lineage, reproducibility, (5) Feast, Tecton, Hopsworks feature stores, point-in-time joins, online/offline features. Provides: scalable, reproducible, leakage-safe ML data pipeline design.

ML Data Pipeline

Overview

Data Data Pipeline Invariants

Raw data is immutable or snapshot-addressable.
Schemas, statistics, and quality expectations are validated before training and serving.
Transformations are versioned and reproducible.
Splits are created before leakage-prone operations such as oversampling, target encoding, feature selection, or normalization.
Time-dependent features use only information available at prediction time.
Offline training features match online serving features.
Sensitive data is minimized, access-controlled, encrypted, and audited.

Ingestion and Storage

Feast Feature Store Blueprint

Feast is a modular feature store for maintaining online-offline feature consistency.

1. Feature Store Configuration (`feature_store.yaml`)

project: fraud_detection
registry: data/registry.db
provider: local
online_store:
  type: redis
  connection_string: "localhost:6379"
offline_store:
  type: file

2. Feature Definitions (`features.py`)

from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, ValueType
from feast.types import Float32, Int64

user = Entity(name="user_id", value_type=ValueType.INT64, join_keys=["user_id"])

user_transactions_source = FileSource(
    path="data/user_transactions.parquet",
    event_timestamp_column="timestamp",
    created_timestamp_column="created_timestamp",
)

user_transactions_fv = FeatureView(
    name="user_transactions_feature_view",
    entities=[user],
    ttl=timedelta(days=90),
    schema=[
        Field(name="transaction_count_30d", dtype=Int64),
        Field(name="total_amount_30d", dtype=Float32),
    ],
    online=True,
    source=user_transactions_source,
)

Processing Engines

High-Performance Polars Lazy Aggregation Pipeline

Polars lazy evaluation optimizes the execution plan using predicate and projection pushdowns.

import polars as pl

def compute_rolling_user_features(transactions_path: str):
    lazy_df = (
        pl.scan_parquet(transactions_path)
        # Cast timestamp for windowing
        .with_columns(pl.col("timestamp").str.strptime(pl.Datetime))
        .sort("timestamp")
        # Define rolling calculation window
        .group_by_dynamic(
            index_column="timestamp",
            every="1d",
            period="30d",
            group_by="user_id"
        )
        .agg([
            pl.col("amount").count().alias("transaction_count_30d"),
            pl.col("amount").sum().alias("total_amount_30d"),
            pl.col("amount").mean().alias("avg_amount_30d")
        ])
        .filter(pl.col("user_id").is_not_null())
    )
    # Collect executes the query optimization and loads results into memory
    return lazy_df.collect()

Spark Dataframe Optimization Recipes

Prevent typical distributed training bottlenecks such as data skew and excessive shuffle overhead.

1. Salting to Prevent Skewed Joins

from pyspark.sql import functions as F

# Adding a salt column to distribute skewed key values evenly across partitions
skewed_df = skewed_df.with_columns(
    (F.rand() * 10).cast("int").alias("salt")
)
lookup_df = lookup_df.with_columns(
    F.explode(F.array([F.lit(i) for i in range(10)])).alias("salt")
)

# Join on key AND salt to distribute the workload
joined_df = skewed_df.join(lookup_df, ["join_key", "salt"], "inner").drop("salt")

2. Broadcast Join for Lookup Tables

from pyspark.sql.functions import broadcast

# Explicitly broadcast small dimension dataframe to executors to avoid shuffling large fact table
optimized_joined = large_fact_df.join(broadcast(small_lookup_df), "entity_id", "inner")

Data Validation

Feature Engineering

Prevent leakage by asking: would this feature be known at the moment the model makes the prediction? If not, exclude it or redesign the target and prediction time.

DVC (Data Version Control) Ingest Workflow

DVC tags large datasets to Git commits via lightweight metadata files, avoiding bloat.

1. Initialize and Add Storage

dvc init
dvc remote add -d myremote s3://my-dvc-bucket/raw-data

2. Track a New Dataset Version

# Add dataset to DVC tracking (creates data.parquet.dvc)
dvc add data/raw_transactions.parquet

# Commit DVC metadata file to Git
git add data/raw_transactions.parquet.dvc data/.gitignore
git commit -m "Track transactions dataset v1.0.0 via DVC"

# Push raw binaries to remote cloud storage
dvc push

3. Retrieve Tracked Version on Another Worker

git pull
dvc pull

Streaming and Online Pipelines

Security and Privacy

Common Pipeline Failures

Train/serving skew from separate preprocessing implementations.
Future leakage in rolling features, aggregates, or target encoders.
Silent schema drift that coerces strings, nulls, or categories incorrectly.
Duplicate entities across train and validation.
Oversampling or normalization before splitting.
Non-versioned datasets causing irreproducible training.
Streaming features that cannot be backfilled for training.

Sources

Great Expectations documentation: https://docs.greatexpectations.io/
TensorFlow Data Validation documentation: https://www.tensorflow.org/tfx/data_validation
Feast documentation: https://docs.feast.dev/
DVC documentation: https://dvc.org/doc
Apache Spark MLlib and SQL docs: https://spark.apache.org/docs/latest/
Polars documentation: https://docs.pola.rs/

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

development

VerifiedTrustedCommunity

Use for Clerk sessions, tokens, webhooks, orgs, and security. PROACTIVELY activate for session tokens, JWT templates, getToken(), custom claims, pending sessions, multi-session UX, organizations, roles, permissions, system vs custom permissions, features/plans, MFA/passkeys/password policy/bot protection, Clerk webhooks, Svix signatures, verifyWebhook(), user/org sync, retries/replays, environment variables, custom domains, secret rotation, logs, and auth security reviews. Provides token semantics, webhook idempotency, authorization defaults, and hardening checklist.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

tools

VerifiedTrustedCommunity

Use for Clerk in Next.js. PROACTIVELY activate for @clerk/nextjs setup, App Router auth()/currentUser(), clerkMiddleware(), proxy.ts/middleware.ts, createRouteMatcher(), protected pages/layouts/Route Handlers/Server Actions/API routes/tRPC, auth.protect() role/permission/token checks, ClerkProvider placement, server-only clerkClient, Link prefetch, redirects, 401/404 auth failures, custom domains, __clerk proxy paths, and deployment gotchas. Provides file patterns, server/client boundary rules, matcher templates, and production checks.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

development

VerifiedTrustedCommunity

Use for Clerk frontend auth flows. PROACTIVELY activate for React, JavaScript, Vue, Nuxt, Astro, Expo, React Router, TanStack React Start, or SPA setup; ClerkProvider and publishable-key wiring; SignIn/SignUp/UserButton/UserProfile/OrganizationSwitcher; custom useUser/useAuth/useClerk/useSignIn/useSignUp/useSession/useOrganization flows; multi-session UX; cross-origin getToken() fetches; loading states, redirects, routing, CORS/cookies, or hydration bugs. Provides SDK selection, UI patterns, token-fetch templates, and frontend gotchas.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

development

VerifiedTrustedCommunity

Use for Clerk dev/prod readiness, deployment, and multi-language implementation planning. PROACTIVELY activate for environment variables, pk_test/sk_test vs pk_live/sk_live, local dev, preview/staging/prod instances, domains/DNS, redirects, OAuth credentials, custom domains/proxy, authorizedParties, CSP, CORS/cookies, webhooks/tunnels, Vercel/Netlify/Cloudflare/API gateways, monitoring/troubleshooting, and backends in Node/Express/Fastify, Python/FastAPI/Django/Flask, Go, Ruby/Rails, Java/Spring, .NET, PHP/Laravel. Provides checklists, rollout plans, and language-portable patterns.

45SKILL.mdUpdated Jun 19, 2026

JosiahSiegel/clerk-environments-deployment

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/JosiahSiegel/claude-plugin-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r claude-plugin-marketplace/plugins/ml-master/skills/ml-data-pipeline ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

JosiahSiegel/claude-plugin-marketplace

38 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

JosiahSiegel/ml-data-pipeline

$ install --global

Security Scan Results

SKILL.md

ML Data Pipeline

Overview

Data Data Pipeline Invariants

Ingestion and Storage

Feast Feature Store Blueprint

1. Feature Store Configuration (feature_store.yaml)

2. Feature Definitions (features.py)

Processing Engines

High-Performance Polars Lazy Aggregation Pipeline

Spark Dataframe Optimization Recipes

1. Salting to Prevent Skewed Joins

2. Broadcast Join for Lookup Tables

Data Validation

Feature Engineering

DVC (Data Version Control) Ingest Workflow

1. Initialize and Add Storage

2. Track a New Dataset Version

3. Retrieve Tracked Version on Another Worker

Streaming and Online Pipelines

Security and Privacy

Common Pipeline Failures

Sources

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

JosiahSiegel/ml-data-pipeline

$ install --global

Security Scan Results

SKILL.md

ML Data Pipeline

Overview

Data Data Pipeline Invariants

Ingestion and Storage

Feast Feature Store Blueprint

1. Feature Store Configuration (feature_store.yaml)

2. Feature Definitions (features.py)

Processing Engines

High-Performance Polars Lazy Aggregation Pipeline

Spark Dataframe Optimization Recipes

1. Salting to Prevent Skewed Joins

2. Broadcast Join for Lookup Tables

Data Validation

Feature Engineering

DVC (Data Version Control) Ingest Workflow

1. Initialize and Add Storage

2. Track a New Dataset Version

3. Retrieve Tracked Version on Another Worker

Streaming and Online Pipelines

Security and Privacy

Common Pipeline Failures

Sources

Related Skills

JosiahSiegel/clerk-sessions-webhooks-security

JosiahSiegel/clerk-nextjs-auth

JosiahSiegel/clerk-frontend-sdks

JosiahSiegel/clerk-environments-deployment

1. Feature Store Configuration (`feature_store.yaml`)

2. Feature Definitions (`features.py`)

1. Feature Store Configuration (`feature_store.yaml`)

2. Feature Definitions (`features.py`)