claude/skills/duckdb/SKILL.md
--- description: DuckDB for SQL analytics, data analysis, and persistent local databases. Use when: (1) analyzing local files (CSV, JSON, Parquet) with SQL — aggregations, joins, window functions, statistics; (2) working with a .duckdb file as a pipeline storage layer — reading, writing, updating records; (3) using the Python DuckDB API — duckdb.connect(), con.execute(), .df(), UPDATE/DELETE/INSERT via Python; (4) querying a project-specific database such as staffing.duckdb, slack_qa.duckdb, or
npx skillsauth add lanej/dotfiles claude/skills/duckdbInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DuckDB is an in-process SQL OLAP database designed for fast analytical queries on local data files. Use this skill for data analysis, statistics, aggregations, and complex queries.
Perfect for:
NOT for:
xlsx, xsv, or cat instead)grep or jq)xsv for simpler cases)Common project databases in this workspace:
areas/staffing/staffing.duckdb — org headcount (use staffing skill)data/slack_qa.duckdb — Slack Q&A analysis pipeline storagedata/*.duckdb — project-specific pipeline databasesDuckDB reads files directly without importing:
-- Query CSV directly
SELECT * FROM 'data.csv' LIMIT 10;
-- Query multiple files with glob patterns
SELECT * FROM 'data/*.parquet';
-- Query JSON
SELECT * FROM 'events.json';
DuckDB can read files without loading them into memory:
DuckDB infers schemas automatically:
-- DuckDB detects column types
DESCRIBE SELECT * FROM 'data.csv';
duckdb -c "SELECT * FROM 'data.csv' WHERE amount > 100"
# Create database file
duckdb mydata.db -c "CREATE TABLE users AS SELECT * FROM 'users.csv'"
# Query later
duckdb mydata.db -c "SELECT COUNT(*) FROM users"
# To CSV
duckdb -c "COPY (SELECT * FROM 'data.csv' WHERE x > 10) TO 'filtered.csv'"
# To Parquet (smaller, faster)
duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'data.parquet'"
# To JSON
duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'output.json'"
# Union all CSV files
duckdb -c "SELECT * FROM 'data/*.csv'"
# Join across files
duckdb -c "
SELECT u.name, o.total
FROM 'users.csv' u
JOIN 'orders.csv' o ON u.id = o.user_id
"
-- Summary statistics
SELECT
COUNT(*) as count,
AVG(amount) as avg_amount,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) as median,
STDDEV(amount) as stddev,
MIN(amount) as min,
MAX(amount) as max
FROM 'transactions.csv';
-- Group by analysis
SELECT
category,
COUNT(*) as total_count,
SUM(amount) as total_amount,
AVG(amount) as avg_amount
FROM 'sales.csv'
GROUP BY category
ORDER BY total_amount DESC;
-- Percentiles
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY value) as p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as p75,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) as p95
FROM 'metrics.csv';
-- Running totals
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) as running_total
FROM 'transactions.csv'
ORDER BY date;
-- Rank by category
SELECT
category,
product,
sales,
RANK() OVER (PARTITION BY category ORDER BY sales DESC) as rank
FROM 'products.csv';
-- Inner join
SELECT
u.name,
o.order_date,
o.total
FROM 'users.csv' u
INNER JOIN 'orders.csv' o ON u.id = o.user_id;
-- Left join with aggregation
SELECT
u.name,
COUNT(o.id) as order_count,
COALESCE(SUM(o.total), 0) as total_spent
FROM 'users.csv' u
LEFT JOIN 'orders.csv' o ON u.id = o.user_id
GROUP BY u.name;
-- Complex WHERE clauses
SELECT * FROM 'data.csv'
WHERE date >= '2024-01-01'
AND category IN ('A', 'B', 'C')
AND amount > 100;
-- CASE expressions
SELECT
name,
amount,
CASE
WHEN amount < 100 THEN 'Small'
WHEN amount < 1000 THEN 'Medium'
ELSE 'Large'
END as size_category
FROM 'transactions.csv';
-- Inspect schema
DESCRIBE SELECT * FROM 'data.csv';
-- Count nulls
SELECT
COUNT(*) - COUNT(column_name) as null_count,
COUNT(*) as total_count,
(COUNT(*) - COUNT(column_name))::FLOAT / COUNT(*) as null_percentage
FROM 'data.csv';
-- Find duplicates
SELECT
column_name,
COUNT(*) as count
FROM 'data.csv'
GROUP BY column_name
HAVING COUNT(*) > 1;
-- Value distribution
SELECT
column_name,
COUNT(*) as frequency,
COUNT(*)::FLOAT / SUM(COUNT(*)) OVER () as percentage
FROM 'data.csv'
GROUP BY column_name
ORDER BY frequency DESC
LIMIT 20;
-- Auto-detect everything
SELECT * FROM 'data.csv';
-- Manual options
SELECT * FROM read_csv('data.csv',
delim=';',
header=true,
columns={'id': 'INTEGER', 'name': 'VARCHAR'}
);
-- Handle messy CSV
SELECT * FROM read_csv('messy.csv',
auto_detect=true,
ignore_errors=true,
max_line_size=1048576
);
-- Line-delimited JSON
SELECT * FROM 'events.ndjson';
-- Standard JSON array
SELECT * FROM read_json('data.json', format='array');
-- Nested JSON
SELECT
data->>'$.user.name' as user_name,
data->>'$.event.type' as event_type
FROM 'events.json';
-- Direct query (fastest for large files)
SELECT * FROM 'data.parquet';
-- Multiple parquet files
SELECT * FROM 'data/*.parquet';
-- Requires spatial extension
INSTALL spatial;
LOAD spatial;
SELECT * FROM st_read('data.xlsx');
WITH monthly_totals AS (
SELECT
DATE_TRUNC('month', date) as month,
SUM(amount) as total
FROM 'transactions.csv'
GROUP BY month
)
SELECT
month,
total,
total - LAG(total) OVER (ORDER BY month) as month_over_month_change
FROM monthly_totals;
SELECT
category,
total,
total / (SELECT SUM(amount) FROM 'sales.csv') as percentage_of_total
FROM (
SELECT category, SUM(amount) as total
FROM 'sales.csv'
GROUP BY category
) subquery
ORDER BY percentage_of_total DESC;
-- Pivot data
PIVOT (
SELECT category, month, amount FROM 'sales.csv'
) ON month USING SUM(amount);
# Default table format
duckdb -c "SELECT * FROM 'data.csv' LIMIT 10"
# Markdown table
duckdb -markdown -c "SELECT * FROM 'data.csv' LIMIT 10"
# JSON output
duckdb -json -c "SELECT * FROM 'data.csv' LIMIT 10"
# CSV output
duckdb -csv -c "SELECT * FROM 'data.csv' LIMIT 10"
# Use box drawing for tables (default)
duckdb -box -c "SELECT * FROM 'data.csv' LIMIT 10"
# ASCII-only output
duckdb -ascii -c "SELECT * FROM 'data.csv' LIMIT 10"
# Convert CSV to Parquet once
duckdb -c "COPY (SELECT * FROM 'large.csv') TO 'large.parquet'"
# Query Parquet (much faster)
duckdb -c "SELECT * FROM 'large.parquet' WHERE x > 100"
-- Good: Filter before aggregation
SELECT category, COUNT(*)
FROM 'data.csv'
WHERE date >= '2024-01-01'
GROUP BY category;
-- Less efficient: Filter after aggregation
SELECT * FROM (
SELECT category, COUNT(*) as cnt FROM 'data.csv' GROUP BY category
)
WHERE cnt > 100;
-- Better: Select only needed columns
SELECT id, name, amount FROM 'large.csv';
-- Slower: Select all columns
SELECT * FROM 'large.csv';
-- Use LIMIT for exploration
SELECT * FROM 'large.csv' LIMIT 100;
-- Use TABLESAMPLE for random sample
SELECT * FROM 'large.csv' USING SAMPLE 1000 ROWS;
# Extract structured data from unstructured source, then analyze
conform extract messy_report.pdf --schema invoice_schema.json > invoices.json
duckdb -c "
SELECT
vendor,
SUM(amount) as total_amount,
COUNT(*) as invoice_count
FROM 'invoices.json'
GROUP BY vendor
ORDER BY total_amount DESC
"
# Parse unstructured text → structured CSV → analysis
conform extract survey_responses.txt --output survey.csv
duckdb -c "
SELECT
sentiment,
COUNT(*) as count,
AVG(satisfaction_score) as avg_score
FROM 'survey.csv'
GROUP BY sentiment
"
# xsv filter → DuckDB aggregate
xsv search -s status "active" data.csv | duckdb -c "SELECT AVG(amount) FROM read_csv('/dev/stdin')"
# jq → DuckDB
jq -c '.' events.json | duckdb -c "SELECT COUNT(*) FROM read_json_auto('/dev/stdin')"
# DuckDB query → CSV → xlsx
duckdb -c "COPY (SELECT * FROM 'data.parquet' WHERE x > 100) TO 'filtered.csv'"
xlsx view filtered.csv
# DuckDB local analysis → BigQuery upload
duckdb -c "COPY (SELECT * FROM 'local.csv' WHERE important = true) TO 'important.csv'"
bigquery insert dataset.table important.csv
SELECT
DATE_TRUNC('day', timestamp) as day,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users,
AVG(duration) as avg_duration
FROM 'events.csv'
GROUP BY day
ORDER BY day;
WITH first_purchase AS (
SELECT user_id, MIN(purchase_date) as cohort_date
FROM 'purchases.csv'
GROUP BY user_id
)
SELECT
DATE_TRUNC('month', cohort_date) as cohort_month,
COUNT(DISTINCT user_id) as cohort_size,
COUNT(DISTINCT CASE WHEN purchase_date < cohort_date + INTERVAL 30 DAYS THEN user_id END) as retained_30d
FROM 'purchases.csv' p
JOIN first_purchase f ON p.user_id = f.user_id
GROUP BY cohort_month;
-- Histogram bins
SELECT
FLOOR(value / 10) * 10 as bin,
COUNT(*) as frequency
FROM 'metrics.csv'
GROUP BY bin
ORDER BY bin;
rowcount is unreliable for UPDATE/DELETEWhen running UPDATE or DELETE via the Python DuckDB API (con.execute(sql)), .rowcount always returns -1 (or a negative multiple if called in a loop), not the actual number of affected rows. The update did execute correctly. To verify affected rows, run a follow-up SELECT COUNT(*) with the same WHERE clause before and after, or query the changed rows directly. This is distinct from SELECT queries where .fetchall() / .df() return the actual result.
# Bad: rowcount is always -1 for UPDATE/DELETE
con.execute("UPDATE users SET active = false WHERE last_login < '2024-01-01'")
print(con.rowcount) # -1, not the actual affected row count
# Good: verify with a follow-up query
affected = con.execute(
"SELECT COUNT(*) FROM users WHERE last_login < '2024-01-01'"
).fetchone()[0]
con.execute("UPDATE users SET active = false WHERE last_login < '2024-01-01'")
print(f"Updated {affected} rows")
# Use single quotes for file paths
duckdb -c "SELECT * FROM 'data.csv'"
# Escape quotes in nested queries
duckdb -c "SELECT * FROM 'data.csv' WHERE name = 'O''Brien'"
# Use .mode for better formatting
duckdb -c ".mode line; SELECT * FROM 'large_result.csv'"
# Or export instead of displaying
duckdb -c "COPY (SELECT * FROM 'data.csv') TO 'output.csv'"
-- Force type casting
SELECT CAST(column AS INTEGER) FROM 'data.csv';
-- Or specify schema
SELECT * FROM read_csv('data.csv', columns={'id': 'INTEGER', 'date': 'DATE'});
LIMIT and DESCRIBE to understand data before complex queries# Basic query
duckdb -c "SELECT * FROM 'data.csv' LIMIT 10"
# Statistics
duckdb -c "SELECT COUNT(*), AVG(amount), STDDEV(amount) FROM 'data.csv'"
# Group by
duckdb -c "SELECT category, SUM(amount) FROM 'data.csv' GROUP BY category"
# Join files
duckdb -c "SELECT * FROM 'users.csv' u JOIN 'orders.csv' o ON u.id = o.user_id"
# Export
duckdb -c "COPY (SELECT * FROM 'data.csv' WHERE x > 10) TO 'filtered.csv'"
# Markdown output
duckdb -markdown -c "SELECT * FROM 'data.csv' LIMIT 5"
data-ai
Delegate research and context-gathering tasks to a sub-agent to protect the primary context window. Use when the user asks to "research X", "look into X", "find out about X", "gather context on X", or any investigative framing where answering requires 2+ searches or multiple sources. Also use proactively before starting substantive work when prior context is unknown. Never run research inline — always delegate.
documentation
--- name: qmd-math description: Math notation conventions for Quarto/EPQ documents rendered via lualatex. Use when: writing or adding a formula, equation, or mathematical expression to a .qmd file; asked about display math, inline math, or LaTeX notation in a QMD/Quarto context; defining a where-clause or variable definitions for an equation; converting prose variable descriptions into structured math notation; fixing math that renders badly in a PDF; using \lvert, \begin{aligned}, \tfrac, \text
development
Trim a prose document (README, design doc, blog post, notes) for readability by cutting redundancy, filler, and dead weight in the author's own words. Invoke with /trim [file path], or /trim alone to be prompted for a file. Not for source code, data files, or summarization.
business
Query and analyze Josh Lane's org headcount from the staffing DuckDB at ~/workspace/areas/staffing/staffing.duckdb. Use when asked about headcount counts, org structure, direct reports, team breakdown, hiring/attrition trends, international employees, salary/pay grade distribution, offboarding lag, or any question about people in Josh's org. Triggers on questions about how many people, who reports to whom, headcount by team/country/level, who joined or left, org size, staffing, headcount trend.