.claude/skills/skills/data-quality/SKILL.md
Quality dimensions quick reference and remediation decision tree for tabular data assessment
npx skillsauth add dathere/qsv data-qualityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For the full step-by-step profiling workflow, use the /data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.
| Dimension | Key Question | Primary Check | Red Flag |
|-----------|-------------|---------------|----------|
| Completeness | Missing values? | stats — nullcount, sparsity | Sparsity > 0.5 |
| Uniqueness | Unwanted duplicates? | stats --cardinality vs row count | Key column cardinality < row count |
| Validity | Correct formats/types? | stats — type; validate schema.json | String type on numeric column |
| Consistency | Uniform formats? | frequency — case variants; sniff — encoding | Same value in different cases |
| Accuracy | Plausible values? | stats — min/max/stddev | Values > 3 stddev from mean |
| Column Name Quality | Headers safe & descriptive? | safenames --verify | Spaces, special chars, or duplicates in headers |
| Conformity | Values follow standards? | searchset with domain regex | Non-standard codes (country, state, zip, phone) |
| Referential Integrity | Foreign keys valid? | joinp --left-anti | Orphaned references across related files |
| Injection Safety | Malicious payloads? | searchset with injection regex | Formula/SQL injection patterns in cells |
| Documentation | Dataset described? | describegpt --all | No Data Dictionary or Description |
When a quality issue is found, choose the right fix:
| Problem | Severity | Fix Command | When to Skip |
|---------|----------|-------------|-------------|
| Ragged rows | High | fixlengths | Never — breaks downstream tools |
| Wrong encoding | High | input | File is already UTF-8 (check with sniff) |
| Unsafe column names | Medium | safenames | Headers already safe (no spaces/special chars) |
| Leading/trailing whitespace | Medium | sqlp with TRIM(col) | Stats show no difference between min/max lengths and trimmed values |
| Duplicate rows | Medium | dedup (or extdedup for >1GB) | stats --cardinality on key columns shows all unique |
| Inconsistent case | Low | sqlp with UPPER(col) or LOWER(col) | frequency shows no case variants |
| Empty values | Low | sqlp with COALESCE(NULLIF(col, ''), 'N/A') | Nulls are semantically meaningful |
| Non-conforming values | Medium | searchset + search --flag | No domain standard applies |
| Orphaned foreign keys | Medium | joinp --left-anti | Single-file dataset with no references |
| Injection payloads | High | searchset with injection regex + sanitize | Data is internal-only and never opened in spreadsheets or loaded into databases |
| Invalid rows | Low | validate schema.json + filter | No schema available |
Always apply fixes in this order to avoid cascading issues:
1. input (encoding — must be UTF-8 before anything else)
2. safenames (headers — fixes names before column references)
3. fixlengths (structure — ensures consistent field counts)
4. sqlp with TRIM() (whitespace — clean values before dedup)
5. dedup (duplicates — remove after trimming so "foo " and "foo" match)
6. validate (validation — check against schema last)
After running stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:
| Cache Column | Quality Signal |
|-------------|----------------|
| nullcount | Completeness — 0 is ideal |
| sparsity | Completeness — ratio of nulls (0.0–1.0) |
| cardinality | Uniqueness — compare to row count |
| type | Validity — check expected types |
| min / max | Accuracy — plausible range? |
| mean / stddev | Accuracy — outlier detection (>3σ) |
| outliers_total_cnt | Accuracy — from moarstats; outlier count per column |
| mode | Consistency — dominant value expected? |
moarstats --advanced)Run moarstats --advanced to enrich the cache with distribution shape metrics:
| Cache Column | Quality Signal |
|-------------|----------------|
| kurtosis | >3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers |
| bimodality_coefficient | >=0.555 suggests bimodal distribution (possible mixed populations) |
| jarque_bera_pvalue | <0.05 = NOT normally distributed; flag analyses assuming normality |
| gini_coefficient | Near 1 = extreme concentration; near 0 = uniform |
| shannon_entropy | Low = concentrated values; high = diverse |
| winsorized_mean | Compare to mean — large difference signals outlier influence |
| median_mean_ratio | <0.8 or >1.2 = significantly skewed; mean may be misleading |
| range_stddev_ratio | Very high = extreme outliers relative to variability |
| cv | >100% = high relative variability; data is highly spread relative to mean |
| mad_stddev_ratio | >0.8 = stddev is reliable; <<0.8 = outliers inflating stddev |
| mode_zscore | Far from 0 = mode is atypical; possible mixed populations |
| trimean | Robust central tendency: (Q1 + 2*median + Q3)/4; compare to mean for skew detection |
| midhinge | Midpoint of middle 50%: (Q1+Q3)/2; robust center measure |
| robust_cv | MAD/abs(median); outlier-resistant coefficient of variation |
| theil_index | Inequality measure (0=equal); decomposable into within/between group; only for positive values |
| mean_ad | Average absolute distance from mean; less sensitive to outliers than stddev |
| simpsons_diversity_index | Probability two random values differ (0-1); more intuitive than entropy |
development
Machine-readable journal format for reproducible data analysis operations
documentation
Performance guide covering index files, stats cache, and frequency cache accelerators for qsv
data-ai
Infer a semantic ontology from all files in the working directory - entities, attributes, relationships, domain taxonomy, and cross-file join paths. Outputs ONTOLOGY.md.
development
Create publication-quality visualizations from CSV/TSV/Excel data using Python