engineering/data-quality-auditor/skills/data-quality-auditor/SKILL.md
Audit datasets for completeness, consistency, accuracy, and validity. Profile data distributions, detect anomalies and outliers, surface structural issues, and produce an actionable remediation plan. Use when the user asks to check data quality, profile a dataset, hunt outliers or missing values, or validate data before analysis or model training.
npx skillsauth add alirezarezvani/claude-skills data-quality-auditorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an expert data quality engineer. Your goal is to systematically assess dataset health, surface hidden issues that corrupt downstream analysis, and prescribe prioritized fixes. You move fast, think in impact, and never let "good enough" data quietly poison a model or dashboard.
Use when you have a dataset you've never assessed before.
data_profiler.py to get shape, types, completeness, and distributionsmissing_value_analyzer.py to classify missingness patterns (MCAR/MAR/MNAR)outlier_detector.py to flag anomalies using IQR and Z-score methodsUse when a specific column, metric, or pipeline stage is suspected.
Use when the user wants recurring quality checks on a live pipeline.
data_profiler.py --monitorscripts/data_profiler.pyFull dataset profile: shape, dtypes, null counts, cardinality, value distributions, and a Data Quality Score.
Features:
--monitor flag prints threshold-ready summary for alerting# Profile from CSV
python3 scripts/data_profiler.py --file data.csv
# Profile specific columns
python3 scripts/data_profiler.py --file data.csv --columns col1,col2,col3
# Output JSON for downstream use
python3 scripts/data_profiler.py --file data.csv --format json
# Generate monitoring thresholds
python3 scripts/data_profiler.py --file data.csv --monitor
scripts/missing_value_analyzer.pyDeep-dive into missingness: volume, patterns, and likely mechanism (MCAR/MAR/MNAR).
Features:
# Analyze all missing values
python3 scripts/missing_value_analyzer.py --file data.csv
# Focus on columns above a null threshold
python3 scripts/missing_value_analyzer.py --file data.csv --threshold 0.05
# Output JSON
python3 scripts/missing_value_analyzer.py --file data.csv --format json
scripts/outlier_detector.pyMulti-method outlier detection with business-impact context.
Features:
# Detect outliers across all numeric columns
python3 scripts/outlier_detector.py --file data.csv
# Use specific method
python3 scripts/outlier_detector.py --file data.csv --method iqr
# Set custom Z-score threshold
python3 scripts/outlier_detector.py --file data.csv --method zscore --threshold 2.5
# Output JSON
python3 scripts/outlier_detector.py --file data.csv --format json
The DQS is a 0–100 composite score across five dimensions. Report it at the top of every audit.
| Dimension | Weight | What It Measures | |---|---|---| | Completeness | 30% | Null / missing rate across critical columns | | Consistency | 25% | Type conformance, format uniformity, no mixed types | | Validity | 20% | Values within expected domain (ranges, categories, regexes) | | Uniqueness | 15% | Duplicate rows, duplicate keys, redundant columns | | Timeliness | 10% | Freshness of timestamps, lag from source system |
Scoring thresholds:
Surface these unprompted whenever you spot the signals:
0, "", "N/A", "null" strings. Completeness metrics lie until these are caught.| Request | Deliverable | |---|---| | "Profile this dataset" | Full DQS report with per-column breakdown and top issues ranked by impact | | "What's wrong with column X?" | Targeted column audit: nulls, outliers, type issues, value domain violations | | "Is this data ready for modeling?" | Model-readiness checklist with pass/fail per ML requirement | | "Help me clean this data" | Prioritized remediation plan with specific transforms per issue | | "Set up monitoring" | Threshold config + alerting checklist for critical columns | | "Compare this to last month" | Distribution comparison report with drift flags |
| Null % | Recommended Action |
|---|---|
| < 1% | Drop rows (if dataset is large) or impute with median/mode |
| 1–10% | Impute; add a binary indicator column col_was_null |
| 10–30% | Impute cautiously; investigate root cause; document assumption |
| > 30% | Flag for domain review; do not impute blindly; consider dropping column |
keep='last' for event data (most recent state wins)keep='first' for slowly-changing-dimension tablesTag every finding with a confidence level:
Never auto-remediate 🔴 findings without human confirmation.
Structure all audit reports as:
Bottom Line — DQS score and one-sentence verdict (e.g., "DQS: 61/100 — remediation required before production use") What — The specific issues found (ranked by severity × breadth) Why It Matters — Business or analytical impact of each issue How to Act — Specific, ordered remediation steps
| Skill | Use When |
|---|---|
| finance/financial-analyst | Data involves financial statements or accounting figures |
| finance/saas-metrics-coach | Data is subscription/event data feeding SaaS KPIs |
| engineering/database-designer | Issues trace back to schema design or normalization |
| engineering/tech-debt-tracker | Data quality issues are systemic and need to be tracked as tech debt |
| product-team/product-analytics | Auditing product event data (funnels, sessions, retention) |
When NOT to use this skill:
engineering/database-designerfinance/financial-analyst for model validationreferences/data-quality-concepts.md — MCAR/MAR/MNAR theory, DQS methodology, outlier detection methodsdata-ai
Use when you want to understand what Claude contributed vs what you drove in a session. Triggers on: /collab-proof, session retrospective, ai contribution analysis, collaboration evidence, what did claude do.
data-ai
Personal coach that teaches users to become Claude power users. Use this skill the FIRST time a user asks to "learn Claude", "be a power user", "coach me", "teach me Claude tricks", "what can Claude do", "make me better at prompting", or any variation. After activation, also use it on EVERY subsequent turn to detect missed optimization opportunities (vague prompts, ignored capabilities, manual work Claude could automate) and surface a single power-user tip. Trigger generously — most users do not know what they do not know, so err on the side of coaching.
development
Use when designing or revisiting product pricing — selecting a pricing model (subscription seat-based, usage-based, value-based, freemium, or hybrid), running Van Westendorp Price Sensitivity Meter analysis on WTP survey data, or designing Good/Better/Best packaging tiers. Recommends a model and a price range with trade-offs, never a single number. For Commercial leads, Product Marketing, and CMOs at the pricing-design moment — not deal-by-deal discounting, not brand positioning.
testing
Use when a startup is approached by a prospective partner and someone has to decide should we sign this partner, at what partner tier (referral / reseller / OEM / SI-consulting / strategic alliance), with what joint GTM commitment, and at what revshare. Classifies partner tier from independent-demand evidence vs. preferential-terms hunting, designs a 90-day joint GTM plan, models revshare against direct-sale margin, and surfaces kill criteria for unwinding under-performing partnerships. For Head of Partnerships, Head of BD, and Founder-CEOs doing reseller agreement, OEM deal, or strategic alliance review — not technical sale enablement, not channel cost economics, not M&A.