skills/eda/SKILL.md
Exploratory Data Analysis skill for CSV and parquet datasets with deterministic profiling, drift/anomaly scans, contract generation and validation, and optional memory writeback into skill-system-memory. The implementation is Polars-first (lazy scan for large files and early `--sample` head), includes high-cardinality guards for profile/importance/contract flows, and supports categorical correlation with Cramer's V. Use when building or reviewing tabular fraud/risk/data-quality workflows, profiling new datasets, checking leakage or drift, or saving/validating data contracts.
npx skillsauth add arthur0824hao/skills skill-system-edaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use scripts/eda.py for deterministic EDA artifacts. The current stable backend is tabular EDA, and the multimodal entrypoint is explore.
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder
python3 scripts/eda.py detect-modality --input data_root
python3 scripts/eda.py explore --input data_or_folder --modality graph
python3 scripts/eda.py graph-viz --input data.csv --features amount,score --id-column account_id --label Class --edge-mode knn --topk 50 --normalize l2 --similarity cosine --output /tmp/eda_graph
python3 scripts/eda.py profile-dataset --input data.csv --target Class --output /tmp/eda
python3 scripts/eda.py distribution-report --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py correlation-matrix --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py anomaly-profiling --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py feature-importance-scan --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py leakage-detector --input data.csv --target Class --profile /tmp/eda/profile.yaml
python3 scripts/eda.py save-contract --profile /tmp/eda/profile.yaml --output /tmp/eda/contract.yaml
python3 scripts/eda.py validate-contract --input new_data.csv --contract /tmp/eda/contract.yaml
profile-dataset creates profile.yaml and report.mdexplore reports detected modalities and selected backend; tabular mode may immediately route into existing tabular profilinggraph-viz emits graph_viz/index.html, graph_viz/graph.json, graph_viz/sliders.json, and appends graph-viz references into profile.yaml and report.mdgraph-viz records renderer_hint, preview_applied, and browser-edge limits so large-graph fallback is explicit rather than silentprofile.yaml and append sections to report.mdsave-contract emits contract.yamlvalidate-contract prints JSON PASS / FAIL with a violation list.head(N) when --sample is used.profile.yaml as the machine-readable source of truth; report.md is the human-readable companion.scan_csv/scan_parquet), with materialization delayed until needed.>50 unique skips one-hot in feature importance, and profile truncates categorical columns (>100 unique or >50% row cardinality) to top-20 values.skill-system-memory/scripts/mem.py store when available.EDA_DISABLE_MEM_PY=1 is set, write fallback payloads under .memory/pending/.--no-memory for deterministic tests or when no writeback is desired.save-contract derives column requirements from profile.yaml.cardinality_range rules instead of allowed_values.validate-contract fails closed and returns machine-readable violations.graph-viz is a tabular-to-graph visualization flow, not a replacement for graph-native modality EDA.graph.json is the viewer payload authority; sliders.json is the UI-control authority.renderer_hint=canvas means the full interactive force layout is expected to be browser-safe.renderer_hint=webgl means dataset scale or edge volume exceeded the canvas-friendly threshold; the shipped viewer still loads, but preview edges are preferred by default.--max-browser-edges controls when preview fallback is applied. Raising it may crash Chromium on very large graphs.Example: Esun-style feature-bank payload generalized into EDA input/output conventions:
python3 scripts/eda.py graph-viz \
--input Work/Study/GNN/FraudDetect/esun_data/combined_features.csv \
--features senior28_01,senior28_02,senior28_03,senior28_04 \
--id-column account_id \
--label is_fraud \
--edge-mode knn \
--topk 50 \
--normalize l2 \
--similarity cosine \
--max-browser-edges 60000 \
--output /tmp/esun_graph_viz
Example: generic customer risk dataset with a multi-class label column:
python3 scripts/eda.py graph-viz \
--input data/customer_risk.parquet \
--features amount,velocity_score,merchant_entropy,geo_distance \
--id-column customer_id \
--label segment \
--edge-mode knn \
--topk 25 \
--normalize l2 \
--similarity cosine \
--output /tmp/customer_graph_viz
{
"schema_version": "2.0",
"id": "skill-system-eda",
"version": "1.1.0",
"capabilities": [
"eda-detect",
"eda-graph-viz",
"eda-profile",
"eda-distribution",
"eda-correlation",
"eda-anomaly",
"eda-feature-importance",
"eda-leakage",
"eda-contract-save",
"eda-contract-validate"
],
"effects": ["fs.read", "fs.write", "proc.exec"],
"operations": {
"profile-dataset": {
"description": "Profile a CSV/parquet dataset and generate profile.yaml plus report.md.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"output": { "type": "string", "required": true },
"sample": { "type": "integer", "required": false },
"no_memory": { "type": "boolean", "required": false }
},
"output": {
"description": "Artifact paths for the generated EDA profile",
"fields": { "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "profile-dataset", "--input", "{input}", "--output", "{output}"]
}
},
"detect-modality": {
"description": "Detect dataset modality and return all matching modality tags.",
"input": {
"input": { "type": "string", "required": true }
},
"output": {
"description": "Detected modalities",
"fields": { "modalities": "array", "path": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "detect-modality", "--input", "{input}"]
}
},
"explore-dataset": {
"description": "Detect dataset modality and route to the appropriate EDA backend.",
"input": {
"input": { "type": "string", "required": true },
"modality": { "type": "string", "required": false },
"output": { "type": "string", "required": false }
},
"output": {
"description": "Detected modalities, selected modality, and backend routing result",
"fields": { "detected_modalities": "array", "selected_modality": "string", "status": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "explore", "--input", "{input}"]
}
},
"graph-viz": {
"description": "Build reusable graph visualization artifacts for tabular or graph datasets.",
"input": {
"input": { "type": "string", "required": true },
"features": { "type": "string", "required": false },
"id_column": { "type": "string", "required": false },
"label": { "type": "string", "required": false },
"edge_mode": { "type": "string", "required": false },
"edge_input": { "type": "string", "required": false },
"topk": { "type": "integer", "required": false },
"normalize": { "type": "string", "required": false },
"similarity": { "type": "string", "required": false },
"sample": { "type": "integer", "required": false },
"output": { "type": "string", "required": true }
},
"output": {
"description": "Graph visualization artifact paths and integration outputs",
"fields": { "html": "string", "graph_json": "string", "slider_config": "string", "renderer_hint": "string", "profile": "string", "report": "string" }
},
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "graph-viz", "--input", "{input}", "--output", "{output}"]
}
},
"distribution-report": {
"description": "Append distribution and class-conditional analysis to an existing profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "distribution-report", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"correlation-matrix": {
"description": "Compute feature and target correlations and append them to profile/report.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": false },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "correlation-matrix", "--input", "{input}", "--profile", "{profile}"]
}
},
"anomaly-profiling": {
"description": "Compare class-conditional distributions and effect sizes.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "anomaly-profiling", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"feature-importance-scan": {
"description": "Rank features with mutual information and optional tree importances.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "feature-importance-scan", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"leakage-detector": {
"description": "Detect high-correlation, target-encoding, and temporal leakage indicators.",
"input": {
"input": { "type": "string", "required": true },
"target": { "type": "string", "required": true },
"profile": { "type": "string", "required": true }
},
"output": { "description": "Updated profile/report paths", "fields": { "profile": "string", "report": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "leakage-detector", "--input", "{input}", "--target", "{target}", "--profile", "{profile}"]
}
},
"save-contract": {
"description": "Generate a data contract from a saved EDA profile.",
"input": {
"profile": { "type": "string", "required": true },
"output": { "type": "string", "required": true }
},
"output": { "description": "Contract path", "fields": { "contract": "string" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "save-contract", "--profile", "{profile}", "--output", "{output}"]
}
},
"validate-contract": {
"description": "Validate a new dataset against a saved contract and emit PASS/FAIL JSON.",
"input": {
"input": { "type": "string", "required": true },
"contract": { "type": "string", "required": true }
},
"output": { "description": "Validation status and violations", "fields": { "status": "string", "violations": "array" } },
"entrypoints": {
"unix": ["python3", "scripts/eda.py", "validate-contract", "--input", "{input}", "--contract", "{contract}"]
}
}
},
"stdout_contract": {
"last_line_json": true
}
}
data-ai
Persistent shared memory for AI agents backed by PostgreSQL (fts + pg_trgm, optional pgvector). Includes compaction logging and maintenance scripts.
tools
ICD Operator
tools
Canonical skill graph navigation skill for the Skill System.
tools
GitHub operations skill for gh CLI issue, label, template, and workflow management. Use when requests include: create issue, list issues, apply label, manage templates, check workflow, or gh operations.