SKILL.md

name:: data-quality
description:: Quality dimensions quick reference and remediation decision tree for tabular data assessment

Data Quality Assessment with qsv

For the full step-by-step profiling workflow, use the /data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.

Quality Dimensions (Quick Reference)

| Dimension | Key Question | Primary Check | Red Flag | |-----------|-------------|---------------|----------| | Completeness | Missing values? | stats — nullcount, sparsity | Sparsity > 0.5 | | Uniqueness | Unwanted duplicates? | stats --cardinality vs row count | Key column cardinality < row count | | Validity | Correct formats/types? | stats — type; validate schema.json | String type on numeric column | | Consistency | Uniform formats? | frequency — case variants; sniff — encoding | Same value in different cases | | Accuracy | Plausible values? | stats — min/max/stddev | Values > 3 stddev from mean | | Column Name Quality | Headers safe & descriptive? | safenames --verify | Spaces, special chars, or duplicates in headers | | Conformity | Values follow standards? | searchset with domain regex | Non-standard codes (country, state, zip, phone) | | Referential Integrity | Foreign keys valid? | joinp --left-anti | Orphaned references across related files | | Injection Safety | Malicious payloads? | searchset with injection regex | Formula/SQL injection patterns in cells | | Documentation | Dataset described? | describegpt --all | No Data Dictionary or Description |

Remediation Decision Tree

When a quality issue is found, choose the right fix:

| Problem | Severity | Fix Command | When to Skip | |---------|----------|-------------|-------------| | Ragged rows | High | fixlengths | Never — breaks downstream tools | | Wrong encoding | High | input | File is already UTF-8 (check with sniff) | | Unsafe column names | Medium | safenames | Headers already safe (no spaces/special chars) | | Leading/trailing whitespace | Medium | sqlp with TRIM(col) | Stats show no difference between min/max lengths and trimmed values | | Duplicate rows | Medium | dedup (or extdedup for >1GB) | stats --cardinality on key columns shows all unique | | Inconsistent case | Low | sqlp with UPPER(col) or LOWER(col) | frequency shows no case variants | | Empty values | Low | sqlp with COALESCE(NULLIF(col, ''), 'N/A') | Nulls are semantically meaningful | | Non-conforming values | Medium | searchset + search --flag | No domain standard applies | | Orphaned foreign keys | Medium | joinp --left-anti | Single-file dataset with no references | | Injection payloads | High | searchset with injection regex + sanitize | Data is internal-only and never opened in spreadsheets or loaded into databases | | Invalid rows | Low | validate schema.json + filter | No schema available |

Fix Ordering

Always apply fixes in this order to avoid cascading issues:

1. input          (encoding — must be UTF-8 before anything else)
2. safenames      (headers — fixes names before column references)
3. fixlengths     (structure — ensures consistent field counts)
4. sqlp with TRIM()    (whitespace — clean values before dedup)
5. dedup          (duplicates — remove after trimming so "foo " and "foo" match)
6. validate       (validation — check against schema last)

Stats Cache as Quality Dashboard

After running stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:

| Cache Column | Quality Signal | |-------------|----------------| | nullcount | Completeness — 0 is ideal | | sparsity | Completeness — ratio of nulls (0.0–1.0) | | cardinality | Uniqueness — compare to row count | | type | Validity — check expected types | | min / max | Accuracy — plausible range? | | mean / stddev | Accuracy — outlier detection (>3σ) | | outliers_total_cnt | Accuracy — from moarstats; outlier count per column | | mode | Consistency — dominant value expected? |

Advanced Stats (via `moarstats --advanced`)

Run moarstats --advanced to enrich the cache with distribution shape metrics:

| Cache Column | Quality Signal | |-------------|----------------| | kurtosis | >3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers | | bimodality_coefficient | >=0.555 suggests bimodal distribution (possible mixed populations) | | jarque_bera_pvalue | <0.05 = NOT normally distributed; flag analyses assuming normality | | gini_coefficient | Near 1 = extreme concentration; near 0 = uniform | | shannon_entropy | Low = concentrated values; high = diverse | | winsorized_mean | Compare to mean — large difference signals outlier influence | | median_mean_ratio | <0.8 or >1.2 = significantly skewed; mean may be misleading | | range_stddev_ratio | Very high = extreme outliers relative to variability | | cv | >100% = high relative variability; data is highly spread relative to mean | | mad_stddev_ratio | >0.8 = stddev is reliable; <<0.8 = outliers inflating stddev | | mode_zscore | Far from 0 = mode is atypical; possible mixed populations | | trimean | Robust central tendency: (Q1 + 2*median + Q3)/4; compare to mean for skew detection | | midhinge | Midpoint of middle 50%: (Q1+Q3)/2; robust center measure | | robust_cv | MAD/abs(median); outlier-resistant coefficient of variation | | theil_index | Inequality measure (0=equal); decomposable into within/between group; only for positive values | | mean_ad | Average absolute distance from mean; less sensitive to outliers than stddev | | simpsons_diversity_index | Probability two random values differ (0-1); more intuitive than entropy |

SKILL.md

name:: data-quality
description:: Quality dimensions quick reference and remediation decision tree for tabular data assessment

Data Quality Assessment with qsv

For the full step-by-step profiling workflow, use the /data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.

Quality Dimensions (Quick Reference)

Remediation Decision Tree

When a quality issue is found, choose the right fix:

Fix Ordering

Always apply fixes in this order to avoid cascading issues:

1. input          (encoding — must be UTF-8 before anything else)
2. safenames      (headers — fixes names before column references)
3. fixlengths     (structure — ensures consistent field counts)
4. sqlp with TRIM()    (whitespace — clean values before dedup)
5. dedup          (duplicates — remove after trimming so "foo " and "foo" match)
6. validate       (validation — check against schema last)

Stats Cache as Quality Dashboard

After running stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:

Advanced Stats (via `moarstats --advanced`)

Run moarstats --advanced to enrich the cache with distribution shape metrics:

Adoption

dathere/data-quality

$ install --global

Security Scan Results

SKILL.md

Data Quality Assessment with qsv

Quality Dimensions (Quick Reference)

Remediation Decision Tree

Fix Ordering

Stats Cache as Quality Dashboard

Advanced Stats (via `moarstats --advanced`)

Related Skills

dathere/reproducible-analysis

dathere/qsv-performance

dathere/infer-ontology

dathere/data-viz

dathere/data-quality

$ install --global

Security Scan Results

SKILL.md

Data Quality Assessment with qsv

Quality Dimensions (Quick Reference)

Remediation Decision Tree

Fix Ordering

Stats Cache as Quality Dashboard

Advanced Stats (via `moarstats --advanced`)

Related Skills

dathere/reproducible-analysis

dathere/qsv-performance

dathere/infer-ontology

dathere/data-viz

Adoption

dathere/data-quality

$ install --global

Security Scan Results

SKILL.md

Data Quality Assessment with qsv

Quality Dimensions (Quick Reference)

Remediation Decision Tree

Fix Ordering

Stats Cache as Quality Dashboard

Advanced Stats (via moarstats --advanced)

Related Skills

dathere/reproducible-analysis

dathere/qsv-performance

dathere/infer-ontology

dathere/data-viz

dathere/data-quality

$ install --global

Security Scan Results

SKILL.md

Data Quality Assessment with qsv

Quality Dimensions (Quick Reference)

Remediation Decision Tree

Fix Ordering

Stats Cache as Quality Dashboard

Advanced Stats (via moarstats --advanced)

Related Skills

dathere/reproducible-analysis

dathere/qsv-performance

dathere/infer-ontology

dathere/data-viz

Advanced Stats (via `moarstats --advanced`)

Advanced Stats (via `moarstats --advanced`)