Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

dathere/data-clean

Name: data-clean
Author: dathere

.claude/skills/skills/data-clean/SKILL.md

npx skillsauth add dathere/qsv data-clean

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Clean

Clean the given tabular data file by fixing common data quality issues.

Cowork note: If relative paths don't resolve, call mcp__qsv__qsv_get_working_dir and mcp__qsv__qsv_set_working_dir to sync the working directory.

Steps

Index: Run mcp__qsv__qsv_index on the file for fast random access in subsequent steps.
Assess current state: Run mcp__qsv__qsv_sniff and mcp__qsv__qsv_count to understand the file format and size.
Profile for cleaning decisions: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true. Read .stats.csv to decide which cleaning steps are needed:

| Stats Column | What It Reveals | Cleaning Action | |-------------|-----------------|-----------------| | nullcount, sparsity | Missing values per column | If sparsity > 0.5, decide: impute, drop column, or flag | | cardinality vs row count | Duplicate rows exist if any key column has cardinality < row count | Run dedup | | min_length, max_length | String length variation | Large gap suggests ragged data or embedded whitespace | | sort_order | Whether data is pre-sorted | Use dedup --sorted for streaming mode if sorted | | mode, mode_count | Dominant values | If mode_count > 80% of rows, investigate data entry defaults | | type | Inferred types | String columns that should be numeric indicate format issues |
Check headers: Run mcp__qsv__qsv_headers to inspect column names. If names contain spaces, special characters, or are duplicated, plan to use safenames.
Build cleaning steps: Apply these operations in order (skip any that aren't needed based on assessment):

a. safenames - Normalize column names to safe, ASCII-only identifiers (removes spaces, special chars, ensures uniqueness)

b. fixlengths - Ensure all rows have the same number of fields (pads short rows, truncates long rows)

c. sqlp - Remove leading/trailing whitespace from columns using TRIM(). Example: SELECT TRIM(col1) AS col1, TRIM(col2) AS col2 FROM _t_1.

d. dedup - Remove exact duplicate rows. Loads all data into memory and sorts internally. Use --sorted if input is already sorted to enable streaming mode with constant memory.

e. validate - If a JSON Schema is available, validate against it and report violations.
Verify results: Run mcp__qsv__qsv_count on the output to confirm row count. Run mcp__qsv__qsv_stats with cardinality: true to verify improvements.
Report changes: Summarize what was cleaned:
- Headers renamed (before -> after)
- Rows with wrong field count (fixed by fixlengths)
- Duplicate rows removed
- Whitespace trimmed

Cleaning Steps

Call each tool sequentially, passing the output of one step as input to the next:

mcp__qsv__qsv_command with command: "safenames", input_file: "<file>", output_file: "step1.csv"
mcp__qsv__qsv_command with command: "fixlengths", input_file: "step1.csv", output_file: "step2.csv"
mcp__qsv__qsv_sqlp with input_file: "step2.csv", sql: "SELECT TRIM(col1) AS col1, TRIM(col2) AS col2, ... FROM _t_1", output_file: "step3.csv" (list all columns with TRIM)
mcp__qsv__qsv_command with command: "dedup", input_file: "step3.csv", output_file: "<output>"

Notes

Always preserve the original file - write output to a new file
For large files (> 100MB), dedup loads entire file into memory to sort and deduplicate; consider using sqlp with SELECT DISTINCT instead
safenames uses --mode conditional by default (only renames if needed)
If the user specifies particular columns to clean, use column selection syntax instead of cleaning all columns
dedup loads all data into memory and sorts internally; if input is already sorted, use --sorted for streaming mode
Use mcp__qsv__qsv_search_tools to find additional cleaning tools if needed (e.g., replace for regex substitution)

dathere/data-clean

.claude/skills/skills/data-clean/SKILL.md

Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate

3,595 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add dathere/qsv data-clean

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 4:51 PM4.8s1 file scanned

SKILL.md

name:: data-clean
description:: Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate
user-invocable:: true
argument-hint:: <file>
allowed-tools:: [mcp__qsv__qsv_sniff, mcp__qsv__qsv_count, mcp__qsv__qsv_headers, mcp__qsv__qsv_index, mcp__qsv__qsv_stats, mcp__qsv__qsv_sqlp, mcp__qsv__qsv_command, mcp__qsv__qsv_list_files, mcp__qsv__qsv_search_tools, mcp__qsv__qsv_get_working_dir, mcp__qsv__qsv_set_working_dir]

Data Clean

Clean the given tabular data file by fixing common data quality issues.

Cowork note: If relative paths don't resolve, call mcp__qsv__qsv_get_working_dir and mcp__qsv__qsv_set_working_dir to sync the working directory.

Steps

Index: Run mcp__qsv__qsv_index on the file for fast random access in subsequent steps.
Assess current state: Run mcp__qsv__qsv_sniff and mcp__qsv__qsv_count to understand the file format and size.
Profile for cleaning decisions: Run mcp__qsv__qsv_stats with cardinality: true, stats_jsonl: true. Read .stats.csv to decide which cleaning steps are needed:

| Stats Column | What It Reveals | Cleaning Action | |-------------|-----------------|-----------------| | nullcount, sparsity | Missing values per column | If sparsity > 0.5, decide: impute, drop column, or flag | | cardinality vs row count | Duplicate rows exist if any key column has cardinality < row count | Run dedup | | min_length, max_length | String length variation | Large gap suggests ragged data or embedded whitespace | | sort_order | Whether data is pre-sorted | Use dedup --sorted for streaming mode if sorted | | mode, mode_count | Dominant values | If mode_count > 80% of rows, investigate data entry defaults | | type | Inferred types | String columns that should be numeric indicate format issues |
Check headers: Run mcp__qsv__qsv_headers to inspect column names. If names contain spaces, special characters, or are duplicated, plan to use safenames.
Build cleaning steps: Apply these operations in order (skip any that aren't needed based on assessment):

a. safenames - Normalize column names to safe, ASCII-only identifiers (removes spaces, special chars, ensures uniqueness)

b. fixlengths - Ensure all rows have the same number of fields (pads short rows, truncates long rows)

c. sqlp - Remove leading/trailing whitespace from columns using TRIM(). Example: SELECT TRIM(col1) AS col1, TRIM(col2) AS col2 FROM _t_1.

d. dedup - Remove exact duplicate rows. Loads all data into memory and sorts internally. Use --sorted if input is already sorted to enable streaming mode with constant memory.

e. validate - If a JSON Schema is available, validate against it and report violations.
Verify results: Run mcp__qsv__qsv_count on the output to confirm row count. Run mcp__qsv__qsv_stats with cardinality: true to verify improvements.
Report changes: Summarize what was cleaned:
- Headers renamed (before -> after)
- Rows with wrong field count (fixed by fixlengths)
- Duplicate rows removed
- Whitespace trimmed

Cleaning Steps

Call each tool sequentially, passing the output of one step as input to the next:

mcp__qsv__qsv_command with command: "safenames", input_file: "<file>", output_file: "step1.csv"
mcp__qsv__qsv_command with command: "fixlengths", input_file: "step1.csv", output_file: "step2.csv"
mcp__qsv__qsv_sqlp with input_file: "step2.csv", sql: "SELECT TRIM(col1) AS col1, TRIM(col2) AS col2, ... FROM _t_1", output_file: "step3.csv" (list all columns with TRIM)
mcp__qsv__qsv_command with command: "dedup", input_file: "step3.csv", output_file: "<output>"

Notes

Always preserve the original file - write output to a new file
For large files (> 100MB), dedup loads entire file into memory to sort and deduplicate; consider using sqlp with SELECT DISTINCT instead
safenames uses --mode conditional by default (only renames if needed)
If the user specifies particular columns to clean, use column selection syntax instead of cleaning all columns
dedup loads all data into memory and sorts internally; if input is already sorted, use --sorted for streaming mode
Use mcp__qsv__qsv_search_tools to find additional cleaning tools if needed (e.g., replace for regex substitution)

Related Skills

dathere/reproducible-analysis

development

VerifiedTrustedCommunity

Machine-readable journal format for reproducible data analysis operations

3,595SKILL.mdUpdated Apr 4, 2026

dathere/reproducible-analysis

dathere/qsv-performance

documentation

VerifiedTrustedCommunity

Performance guide covering index files, stats cache, and frequency cache accelerators for qsv

3,595SKILL.mdUpdated Apr 4, 2026

dathere/qsv-performance

dathere/infer-ontology

data-ai

VerifiedTrustedCommunity

Infer a semantic ontology from all files in the working directory - entities, attributes, relationships, domain taxonomy, and cross-file join paths. Outputs ONTOLOGY.md.

3,595SKILL.mdUpdated Apr 4, 2026

dathere/infer-ontology

dathere/data-viz

development

VerifiedTrustedCommunity

Create publication-quality visualizations from CSV/TSV/Excel data using Python

3,595SKILL.mdUpdated Apr 4, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/dathere/qsv.git

# Copy into Claude Code skills folder (global)
cp -r qsv/.claude/skills/skills/data-clean ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

dathere/qsv

3,595 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT