.claude/skills/data-validation/SKILL.md
# Data Validation ## When to Load This Skill Load when working with: Pandera DataFrame schemas, Great Expectations suites, data quality checks, input validation for ML pipelines, data contracts between pipeline stages. ## Pandera — DataFrame Schema Validation Define schemas declaratively and validate at pipeline boundaries: ```python import pandera as pa from pandera.typing import DataFrame, Series class InputSchema(pa.DataFrameModel): user_id: Series[int] = pa.Field(ge=0, nullable=Fa
npx skillsauth add pyramidheadshark/ml-claude-infra .claude/skills/data-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Load when working with: Pandera DataFrame schemas, Great Expectations suites, data quality checks, input validation for ML pipelines, data contracts between pipeline stages.
Define schemas declaratively and validate at pipeline boundaries:
import pandera as pa
from pandera.typing import DataFrame, Series
class InputSchema(pa.DataFrameModel):
user_id: Series[int] = pa.Field(ge=0, nullable=False)
age: Series[float] = pa.Field(ge=0, le=120, nullable=True)
category: Series[str] = pa.Field(isin=["A", "B", "C"])
score: Series[float] = pa.Field(ge=0.0, le=1.0)
class Config:
strict = True
coerce = True
@pa.check_types
def preprocess(df: DataFrame[InputSchema]) -> DataFrame[InputSchema]:
return df.dropna(subset=["user_id"])
Validate without decorator:
try:
InputSchema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
print(e.failure_cases)
Use Pydantic for row-level validation in ingestion endpoints:
from pydantic import BaseModel, Field, field_validator
from typing import Literal
class RecordInput(BaseModel):
user_id: int = Field(ge=0)
age: float | None = Field(default=None, ge=0, le=120)
category: Literal["A", "B", "C"]
score: float = Field(ge=0.0, le=1.0)
@field_validator("score")
@classmethod
def score_precision(cls, v: float) -> float:
return round(v, 6)
from fastapi import APIRouter, HTTPException
import pandera as pa
router = APIRouter()
@router.post("/ingest")
async def ingest_batch(records: list[RecordInput]) -> dict:
df = pd.DataFrame([r.model_dump() for r in records])
try:
InputSchema.validate(df, lazy=True)
except pa.errors.SchemaErrors as e:
raise HTTPException(status_code=422, detail=e.failure_cases.to_dict())
return {"accepted": len(df)}
Validate at each stage boundary:
class FeatureSchema(pa.DataFrameModel):
feature_1: Series[float] = pa.Field(nullable=False)
feature_2: Series[float] = pa.Field(nullable=False)
target: Series[int] = pa.Field(isin=[0, 1])
class Config:
strict = False
class PredictionSchema(pa.DataFrameModel):
user_id: Series[int]
probability: Series[float] = pa.Field(ge=0.0, le=1.0)
label: Series[int] = pa.Field(isin=[0, 1])
For lightweight checks without a full framework:
from dataclasses import dataclass
from typing import Callable
import pandas as pd
@dataclass
class Check:
name: str
fn: Callable[[pd.DataFrame], bool]
error_msg: str
def run_checks(df: pd.DataFrame, checks: list[Check]) -> list[str]:
failures = []
for check in checks:
if not check.fn(df):
failures.append(f"{check.name}: {check.error_msg}")
return failures
QUALITY_CHECKS = [
Check("no_nulls_user_id", lambda df: df["user_id"].notna().all(), "user_id has nulls"),
Check("score_range", lambda df: df["score"].between(0, 1).all(), "score out of [0,1]"),
Check("min_rows", lambda df: len(df) >= 10, "batch too small (< 10 rows)"),
]
strict=True rejects any columns not in the schema — use strict=False for pass-through pipelines where extra columns are expectedlazy=True in validate() collects ALL failures before raising — use it for batch reporting; without it, validation stops at the first errorfield_validator runs AFTER type coercion — validate the coerced value, not the raw input stringtesting
# Design Doc Creator ## When to Load This Skill Load when: design documents, requirements, new project start. Short fixture skill for testing (optional/meta skill).
development
# Windows Developer Guide ## When to Load Automatically loaded on Windows (`platform_trigger: "win32"`). Applies to: `.py`, `.ps1`, `.bat`, `.cmd` files and any Windows-specific workflow. ## Python on Windows ### Encoding (CRITICAL) Windows defaults to `cp1251` / `cp1252` for file I/O. Always specify UTF-8 explicitly: ```python with open("file.txt", "r", encoding="utf-8") as f: content = f.read() Path("file.txt").read_text(encoding="utf-8") Path("file.txt").write_text(content, encodin
development
# Test-First Patterns ## When to Load This Skill Load when writing tests, creating `.feature` files, setting up conftest, discussing test strategy, or reviewing coverage. ## Philosophy Tests are written BEFORE code. Always. No exceptions. The order is: Design Doc → BDD Scenarios → Unit Tests → Implementation. BDD scenarios come from the design document's use cases section — they are a direct translation of business requirements into executable specifications. This makes tests the living do
testing
# Skill: Supply Chain Auditor ## When to Load Auto-load when: adding dependencies, reviewing packages, updating versions, or discussing `requirements.txt`, `pyproject.toml`, `package.json`. Triggers on `dependency`, `install`, `package`, `CVE`, `audit`, `vulnerable` (≥2 keywords). ## Core Rules Every new dependency addition must pass this checklist before merging: 1. **Pinned** — exact version in production (`==1.2.3` for pip, `"1.2.3"` for npm, not `^` or `~`). 2. **Maintained** — last com