.claude/skills/skills/data-clean/SKILL.md
Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate
npx skillsauth add dathere/qsv data-cleanInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Clean the given tabular data file by fixing common data quality issues.
Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dirandmcp__qsv__qsv_set_working_dirto sync the working directory.
Index: Run mcp__qsv__qsv_index on the file for fast random access in subsequent steps.
Assess current state: Run mcp__qsv__qsv_sniff and mcp__qsv__qsv_count to understand the file format and size.
Profile for cleaning decisions: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true. Read .stats.csv to decide which cleaning steps are needed:
| Stats Column | What It Reveals | Cleaning Action |
|-------------|-----------------|-----------------|
| nullcount, sparsity | Missing values per column | If sparsity > 0.5, decide: impute, drop column, or flag |
| cardinality vs row count | Duplicate rows exist if any key column has cardinality < row count | Run dedup |
| min_length, max_length | String length variation | Large gap suggests ragged data or embedded whitespace |
| sort_order | Whether data is pre-sorted | Use dedup --sorted for streaming mode if sorted |
| mode, mode_count | Dominant values | If mode_count > 80% of rows, investigate data entry defaults |
| type | Inferred types | String columns that should be numeric indicate format issues |
Check headers: Run mcp__qsv__qsv_headers to inspect column names. If names contain spaces, special characters, or are duplicated, plan to use safenames.
Build cleaning steps: Apply these operations in order (skip any that aren't needed based on assessment):
a. safenames - Normalize column names to safe, ASCII-only identifiers (removes spaces, special chars, ensures uniqueness)
b. fixlengths - Ensure all rows have the same number of fields (pads short rows, truncates long rows)
c. sqlp - Remove leading/trailing whitespace from columns using TRIM(). Example: SELECT TRIM(col1) AS col1, TRIM(col2) AS col2 FROM _t_1.
d. dedup - Remove exact duplicate rows. Loads all data into memory and sorts internally. Use --sorted if input is already sorted to enable streaming mode with constant memory.
e. validate - If a JSON Schema is available, validate against it and report violations.
Verify results: Run mcp__qsv__qsv_count on the output to confirm row count. Run mcp__qsv__qsv_stats with cardinality: true to verify improvements.
Report changes: Summarize what was cleaned:
Call each tool sequentially, passing the output of one step as input to the next:
mcp__qsv__qsv_command with command: "safenames", input_file: "<file>", output_file: "step1.csv"mcp__qsv__qsv_command with command: "fixlengths", input_file: "step1.csv", output_file: "step2.csv"mcp__qsv__qsv_sqlp with input_file: "step2.csv", sql: "SELECT TRIM(col1) AS col1, TRIM(col2) AS col2, ... FROM _t_1", output_file: "step3.csv" (list all columns with TRIM)mcp__qsv__qsv_command with command: "dedup", input_file: "step3.csv", output_file: "<output>"dedup loads entire file into memory to sort and deduplicate; consider using sqlp with SELECT DISTINCT insteadsafenames uses --mode conditional by default (only renames if needed)dedup loads all data into memory and sorts internally; if input is already sorted, use --sorted for streaming modemcp__qsv__qsv_search_tools to find additional cleaning tools if needed (e.g., replace for regex substitution)development
Machine-readable journal format for reproducible data analysis operations
documentation
Performance guide covering index files, stats cache, and frequency cache accelerators for qsv
data-ai
Infer a semantic ontology from all files in the working directory - entities, attributes, relationships, domain taxonomy, and cross-file join paths. Outputs ONTOLOGY.md.
development
Create publication-quality visualizations from CSV/TSV/Excel data using Python