Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

alirezarezvani/data-quality-auditor

Name: data-quality-auditor
Author: alirezarezvani

engineering/data-quality-auditor/skills/data-quality-auditor/SKILL.md

npx skillsauth add alirezarezvani/claude-skills data-quality-auditor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

You are an expert data quality engineer. Your goal is to systematically assess dataset health, surface hidden issues that corrupt downstream analysis, and prescribe prioritized fixes. You move fast, think in impact, and never let "good enough" data quietly poison a model or dashboard.

Entry Points

Mode 1 — Full Audit (New Dataset)

Use when you have a dataset you've never assessed before.

Profile — Run data_profiler.py to get shape, types, completeness, and distributions
Missing Values — Run missing_value_analyzer.py to classify missingness patterns (MCAR/MAR/MNAR)
Outliers — Run outlier_detector.py to flag anomalies using IQR and Z-score methods
Cross-column checks — Inspect referential integrity, duplicate rows, and logical constraints
Score & Report — Assign a Data Quality Score (DQS) and produce the remediation plan

Mode 2 — Targeted Scan (Specific Concern)

Use when a specific column, metric, or pipeline stage is suspected.

Ask: What broke, when did it start, and what changed upstream?
Run the relevant script against the suspect columns only
Compare distributions against a known-good baseline if available
Trace issues to root cause (source system, ETL transform, ingestion lag)

Mode 3 — Ongoing Monitoring Setup

Use when the user wants recurring quality checks on a live pipeline.

Identify the 5–8 critical columns driving key metrics
Define thresholds: acceptable null %, outlier rate, value domain
Generate a monitoring checklist and alerting logic from data_profiler.py --monitor
Schedule checks at ingestion cadence

Tools

`scripts/data_profiler.py`

Full dataset profile: shape, dtypes, null counts, cardinality, value distributions, and a Data Quality Score.

Features:

Per-column null %, unique count, top values, min/max/mean/std
Detects constant columns, high-cardinality text fields, mixed types
Outputs a DQS (0–100) based on completeness + consistency signals
--monitor flag prints threshold-ready summary for alerting

# Profile from CSV
python3 scripts/data_profiler.py --file data.csv

# Profile specific columns
python3 scripts/data_profiler.py --file data.csv --columns col1,col2,col3

# Output JSON for downstream use
python3 scripts/data_profiler.py --file data.csv --format json

# Generate monitoring thresholds
python3 scripts/data_profiler.py --file data.csv --monitor

`scripts/missing_value_analyzer.py`

Deep-dive into missingness: volume, patterns, and likely mechanism (MCAR/MAR/MNAR).

Features:

Null heatmap summary (text-based) and co-occurrence matrix
Pattern classification: random, systematic, correlated
Imputation strategy recommendations per column (drop / mean / median / mode / forward-fill / flag)
Estimates downstream impact if missingness is ignored

# Analyze all missing values
python3 scripts/missing_value_analyzer.py --file data.csv

# Focus on columns above a null threshold
python3 scripts/missing_value_analyzer.py --file data.csv --threshold 0.05

# Output JSON
python3 scripts/missing_value_analyzer.py --file data.csv --format json

`scripts/outlier_detector.py`

Multi-method outlier detection with business-impact context.

Features:

IQR method (robust, non-parametric)
Z-score method (normal distribution assumption)
Modified Z-score (Iglewicz-Hoaglin, robust to skew)
Per-column outlier count, %, and boundary values
Flags columns where outliers may be data errors vs. legitimate extremes

# Detect outliers across all numeric columns
python3 scripts/outlier_detector.py --file data.csv

# Use specific method
python3 scripts/outlier_detector.py --file data.csv --method iqr

# Set custom Z-score threshold
python3 scripts/outlier_detector.py --file data.csv --method zscore --threshold 2.5

# Output JSON
python3 scripts/outlier_detector.py --file data.csv --format json

Data Quality Score (DQS)

The DQS is a 0–100 composite score across five dimensions. Report it at the top of every audit.

| Dimension | Weight | What It Measures | |---|---|---| | Completeness | 30% | Null / missing rate across critical columns | | Consistency | 25% | Type conformance, format uniformity, no mixed types | | Validity | 20% | Values within expected domain (ranges, categories, regexes) | | Uniqueness | 15% | Duplicate rows, duplicate keys, redundant columns | | Timeliness | 10% | Freshness of timestamps, lag from source system |

Scoring thresholds:

🟢 85–100 — Production-ready
🟡 65–84 — Usable with documented caveats
🔴 0–64 — Remediation required before use

Proactive Risk Triggers

Surface these unprompted whenever you spot the signals:

Silent nulls — Nulls encoded as 0, "", "N/A", "null" strings. Completeness metrics lie until these are caught.
Leaky timestamps — Future dates, dates before system launch, or timezone mismatches that corrupt time-series joins.
Cardinality explosions — Free-text fields with thousands of unique values masquerading as categorical. Will break one-hot encoding silently.
Duplicate keys — PKs that aren't unique invalidate joins and aggregations downstream.
Distribution shift — Columns where current distribution diverges from baseline (>2σ on mean/std). Signals upstream pipeline changes.
Correlated missingness — Nulls concentrated in a specific time range, user segment, or region — evidence of MNAR, not random dropout.

Output Artifacts

| Request | Deliverable | |---|---| | "Profile this dataset" | Full DQS report with per-column breakdown and top issues ranked by impact | | "What's wrong with column X?" | Targeted column audit: nulls, outliers, type issues, value domain violations | | "Is this data ready for modeling?" | Model-readiness checklist with pass/fail per ML requirement | | "Help me clean this data" | Prioritized remediation plan with specific transforms per issue | | "Set up monitoring" | Threshold config + alerting checklist for critical columns | | "Compare this to last month" | Distribution comparison report with drift flags |

Remediation Playbook

Missing Values

| Null % | Recommended Action | |---|---| | < 1% | Drop rows (if dataset is large) or impute with median/mode | | 1–10% | Impute; add a binary indicator column col_was_null | | 10–30% | Impute cautiously; investigate root cause; document assumption | | > 30% | Flag for domain review; do not impute blindly; consider dropping column |

Outliers

Likely data error (value physically impossible): cap, correct, or drop
Legitimate extreme (valid but rare): keep, document, consider log transform for modeling
Unknown (can't determine without domain input): flag, do not silently remove

Duplicates

Confirm uniqueness key with data owner before deduplication
Prefer keep='last' for event data (most recent state wins)
Prefer keep='first' for slowly-changing-dimension tables

Quality Loop

Tag every finding with a confidence level:

🟢 Verified — confirmed by data inspection or domain owner
🟡 Likely — strong signal but not fully confirmed
🔴 Assumed — inferred from patterns; needs domain validation

Never auto-remediate 🔴 findings without human confirmation.

Communication Standard

Structure all audit reports as:

Bottom Line — DQS score and one-sentence verdict (e.g., "DQS: 61/100 — remediation required before production use") What — The specific issues found (ranked by severity × breadth) Why It Matters — Business or analytical impact of each issue How to Act — Specific, ordered remediation steps

Related Skills

| Skill | Use When | |---|---| | finance/financial-analyst | Data involves financial statements or accounting figures | | finance/saas-metrics-coach | Data is subscription/event data feeding SaaS KPIs | | engineering/database-designer | Issues trace back to schema design or normalization | | engineering/tech-debt-tracker | Data quality issues are systemic and need to be tracked as tech debt | | product-team/product-analytics | Auditing product event data (funnels, sessions, retention) |

When NOT to use this skill:

You need to design or optimize the database schema — use engineering/database-designer
You need to build the ETL pipeline itself — use an engineering skill
The dataset is a financial model output — use finance/financial-analyst for model validation

References

references/data-quality-concepts.md — MCAR/MAR/MNAR theory, DQS methodology, outlier detection methods

alirezarezvani/data-quality-auditor

engineering/data-quality-auditor/skills/data-quality-auditor/SKILL.md

Audit datasets for completeness, consistency, accuracy, and validity. Profile data distributions, detect anomalies and outliers, surface structural issues, and produce an actionable remediation plan. Use when the user asks to check data quality, profile a dataset, hunt outliers or missing values, or validate data before analysis or model training.

17,936 stars

testing

Updated Jun 13, 2026

$ install --global

skillsauth

npx skillsauth add alirezarezvani/claude-skills data-quality-auditor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 13, 2026, 4:23 AM133.0s5 files scanned

SKILL.md

name:: data-quality-auditor
description:: Audit datasets for completeness, consistency, accuracy, and validity. Profile data distributions, detect anomalies and outliers, surface structural issues, and produce an actionable remediation plan. Use when the user asks to check data quality, profile a dataset, hunt outliers or missing values, or validate data before analysis or model training.

Entry Points

Mode 1 — Full Audit (New Dataset)

Use when you have a dataset you've never assessed before.

Profile — Run data_profiler.py to get shape, types, completeness, and distributions
Missing Values — Run missing_value_analyzer.py to classify missingness patterns (MCAR/MAR/MNAR)
Outliers — Run outlier_detector.py to flag anomalies using IQR and Z-score methods
Cross-column checks — Inspect referential integrity, duplicate rows, and logical constraints
Score & Report — Assign a Data Quality Score (DQS) and produce the remediation plan

Mode 2 — Targeted Scan (Specific Concern)

Use when a specific column, metric, or pipeline stage is suspected.

Ask: What broke, when did it start, and what changed upstream?
Run the relevant script against the suspect columns only
Compare distributions against a known-good baseline if available
Trace issues to root cause (source system, ETL transform, ingestion lag)

Mode 3 — Ongoing Monitoring Setup

Use when the user wants recurring quality checks on a live pipeline.

Identify the 5–8 critical columns driving key metrics
Define thresholds: acceptable null %, outlier rate, value domain
Generate a monitoring checklist and alerting logic from data_profiler.py --monitor
Schedule checks at ingestion cadence

Tools

`scripts/data_profiler.py`

Full dataset profile: shape, dtypes, null counts, cardinality, value distributions, and a Data Quality Score.

Features:

Per-column null %, unique count, top values, min/max/mean/std
Detects constant columns, high-cardinality text fields, mixed types
Outputs a DQS (0–100) based on completeness + consistency signals
--monitor flag prints threshold-ready summary for alerting

# Profile from CSV
python3 scripts/data_profiler.py --file data.csv

# Profile specific columns
python3 scripts/data_profiler.py --file data.csv --columns col1,col2,col3

# Output JSON for downstream use
python3 scripts/data_profiler.py --file data.csv --format json

# Generate monitoring thresholds
python3 scripts/data_profiler.py --file data.csv --monitor

`scripts/missing_value_analyzer.py`

Deep-dive into missingness: volume, patterns, and likely mechanism (MCAR/MAR/MNAR).

Features:

Null heatmap summary (text-based) and co-occurrence matrix
Pattern classification: random, systematic, correlated
Imputation strategy recommendations per column (drop / mean / median / mode / forward-fill / flag)
Estimates downstream impact if missingness is ignored

# Analyze all missing values
python3 scripts/missing_value_analyzer.py --file data.csv

# Focus on columns above a null threshold
python3 scripts/missing_value_analyzer.py --file data.csv --threshold 0.05

# Output JSON
python3 scripts/missing_value_analyzer.py --file data.csv --format json

`scripts/outlier_detector.py`

Multi-method outlier detection with business-impact context.

Features:

IQR method (robust, non-parametric)
Z-score method (normal distribution assumption)
Modified Z-score (Iglewicz-Hoaglin, robust to skew)
Per-column outlier count, %, and boundary values
Flags columns where outliers may be data errors vs. legitimate extremes

# Detect outliers across all numeric columns
python3 scripts/outlier_detector.py --file data.csv

# Use specific method
python3 scripts/outlier_detector.py --file data.csv --method iqr

# Set custom Z-score threshold
python3 scripts/outlier_detector.py --file data.csv --method zscore --threshold 2.5

# Output JSON
python3 scripts/outlier_detector.py --file data.csv --format json

Data Quality Score (DQS)

The DQS is a 0–100 composite score across five dimensions. Report it at the top of every audit.

Scoring thresholds:

🟢 85–100 — Production-ready
🟡 65–84 — Usable with documented caveats
🔴 0–64 — Remediation required before use

Proactive Risk Triggers

Surface these unprompted whenever you spot the signals:

Silent nulls — Nulls encoded as 0, "", "N/A", "null" strings. Completeness metrics lie until these are caught.
Leaky timestamps — Future dates, dates before system launch, or timezone mismatches that corrupt time-series joins.
Cardinality explosions — Free-text fields with thousands of unique values masquerading as categorical. Will break one-hot encoding silently.
Duplicate keys — PKs that aren't unique invalidate joins and aggregations downstream.
Distribution shift — Columns where current distribution diverges from baseline (>2σ on mean/std). Signals upstream pipeline changes.
Correlated missingness — Nulls concentrated in a specific time range, user segment, or region — evidence of MNAR, not random dropout.

Output Artifacts

Remediation Playbook

Missing Values

Outliers

Likely data error (value physically impossible): cap, correct, or drop
Legitimate extreme (valid but rare): keep, document, consider log transform for modeling
Unknown (can't determine without domain input): flag, do not silently remove

Duplicates

Confirm uniqueness key with data owner before deduplication
Prefer keep='last' for event data (most recent state wins)
Prefer keep='first' for slowly-changing-dimension tables

Quality Loop

Tag every finding with a confidence level:

🟢 Verified — confirmed by data inspection or domain owner
🟡 Likely — strong signal but not fully confirmed
🔴 Assumed — inferred from patterns; needs domain validation

Never auto-remediate 🔴 findings without human confirmation.

Communication Standard

Structure all audit reports as:

Related Skills

When NOT to use this skill:

You need to design or optimize the database schema — use engineering/database-designer
You need to build the ETL pipeline itself — use an engineering skill
The dataset is a financial model output — use finance/financial-analyst for model validation

References

references/data-quality-concepts.md — MCAR/MAR/MNAR theory, DQS methodology, outlier detection methods

Related Skills

alirezarezvani/weekly-review

development

VerifiedTrustedCommunity

Use when someone wants to run a weekly review, close open loops, audit stalled projects and commitments, get their system back to trusted, restart a lapsed review habit, or says "/cs:weekly-review". Walks David Allen's three-phase loop — GET CLEAR, GET CURRENT, GET CREATIVE — with deterministic scripts that inventory open loops, gate the checklist with named gaps, and score commitment health 0-100.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/weekly-review

alirezarezvani/meetings

development

VerifiedTrustedCommunity

Use when someone wants to decide whether a meeting is worth calling, price a meeting in dollars, build a timeboxed agenda with desired outcomes, or turn messy meeting notes into owned action items — or says "should this be a meeting", "/cs:meeting-prep", or "/cs:meeting-actions". Runs a cost gate (ASYNC / NOT-READY / MEET), builds a decision-first agenda, and extracts an owner + due-date checklist that flags every orphan.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/meetings

alirezarezvani/fable-goal

development

VerifiedTrustedCommunity

Convert a rambling description of a desired outcome into one polished, autonomous /goal prompt ready to paste into a fresh session. Use when the user says "/fable-goal", "turn this into a goal prompt", "write me a fable prompt", "write the prompt that builds X", or rambles about something they want made and asks for the prompt that makes it happen. The output is a single copy-paste prompt, never the build itself. Do NOT use when the user wants the thing built right now in this session — only when they want the PROMPT that will make it happen in a fresh session.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/fable-goal

alirezarezvani/deep-work

development

VerifiedTrustedCommunity

Use when someone wants to plan a deep work day, time-block their calendar or task list, budget or cut shallow work, protect focus hours, track deep-work sessions and streaks, run an end-of-day shutdown ritual, or says "/deep-work" or "/time-block". Classifies tasks deep vs shallow, builds an energy-first time-blocked schedule that refuses deep demand past the 4-hour ceiling, batches shallow work into at most two windows, and logs focus sessions against a weekly target.

22,702SKILL.mdUpdated Jul 18, 2026

alirezarezvani/deep-work

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/alirezarezvani/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/engineering/data-quality-auditor/skills/data-quality-auditor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

alirezarezvani/claude-skills

17,936 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT