.claude/skills/skills/qsv-performance/SKILL.md
Performance guide covering index files, stats cache, and frequency cache accelerators for qsv
npx skillsauth add dathere/qsv qsv-performanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
.csv.idx)Created by: qsv index
Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇
| Benefit | Without Index | With Index | |---------|--------------|------------| | Row count | Scan entire file | Instant (stored in index) | | Random access | Sequential scan | O(1) lookup | | Multithreaded | Not possible | Enabled for many commands | | Slicing | Read from start | Jump to position |
Rule: Always run index first if you'll run 2+ commands on the same file.
Auto-indexing: The MCP server auto-indexes files > 10MB.
.stats.csv + .stats.csv.data.jsonl)Created by: qsv stats --cardinality --stats-jsonl
Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)
| Smart Command | What It Uses from Cache |
|--------------|------------------------|
| frequency | Cardinality to skip all-unique columns |
| schema | Data types for JSON Schema generation |
| sqlp | Column types for Polars optimization |
| joinp | Cardinality for optimal join order |
| pivotp | Cardinality to estimate output width |
| diff | Column types for comparison |
Rule: Run stats --cardinality --stats-jsonl before using any smart command.
Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.
Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)
| Benefit | Standard (csv crate) | Polars Engine | |---------|---------------------|---------------| | Processing model | Row-by-row streaming | Vectorized columnar | | Memory | Streaming (constant) | Columnar (efficient) | | Parallelism | Single-threaded | Multi-threaded | | Large files | Limited by memory | Larger-than-memory | | SQL support | N/A | Full SQL dialect |
Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.
For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with mcp__qsv__qsv_to_parquet. Parquet is a columnar format that speeds up repeated SQL queries in mcp__qsv__qsv_sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; mcp__qsv__qsv_sqlp with SKIP_INPUT as the input_file value also works. Note: mcp__qsv__qsv_sqlp can query CSV of any size directly — Parquet is an optimization for repeated queries, not a requirement. Parquet works ONLY with mcp__qsv__qsv_sqlp and DuckDB — all other qsv commands require CSV/TSV/SSV input.
dedup, reverse, sort, stats (with extended stats), table, transpose
frequency, join, schema, tojsonl
Everything else - select, search, slice, replace, count, etc.
File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│ ├── Always: index first
│ ├── Repeated SQL: consider Parquet with qsv_to_parquet
│ ├── Prefer: streaming commands
│ └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│ ├── Always: index + stats cache first
│ ├── Repeated SQL: consider Parquet with qsv_to_parquet
│ ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│ ├── Avoid: sort, reverse, table (load entire file)
│ └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
├── Must: index + stats cache
├── Repeated SQL: convert to Parquet with qsv_to_parquet
├── Must: Polars commands only for joins/queries
├── Avoid: all 🤯 commands
└── Consider: split into chunks, process, cat rows
| Tip | Why |
|-----|-----|
| Use --output file.csv | Avoids stdout buffering overhead |
| Use count before stats | Fast row count for progress bars |
| Use select early in pipeline | Reduce columns = faster processing |
| Use --no-headers only when needed | Header detection is cheap |
| Use slice --len N for previews | Don't read entire file to inspect |
| Prefer joinp over join | Polars engine is significantly faster |
| Use frequency --limit N | Don't compute all unique values |
| Use stats --cardinality | Enables smart optimizations downstream |
The MCP server limits concurrent qsv operations (default: 1). For multiple independent files, the agent can issue separate tool calls.
QSV_MCP_OPERATION_TIMEOUT_MS)development
Machine-readable journal format for reproducible data analysis operations
data-ai
Infer a semantic ontology from all files in the working directory - entities, attributes, relationships, domain taxonomy, and cross-file join paths. Outputs ONTOLOGY.md.
development
Create publication-quality visualizations from CSV/TSV/Excel data using Python
testing
Validate data and analysis before sharing - methodology, accuracy, bias, and data quality checks