skills/data-analysis/SKILL.md
Data analysis workflows with pandas, Polars, and DuckDB for exploration, cleaning, and transformation. Use when: performing exploratory data analysis, cleaning messy datasets, computing aggregations and statistics, writing Jupyter/Marimo notebook workflows, choosing between pandas and Polars, querying Parquet files with DuckDB, building reproducible analysis notebooks, auditing existing analysis code for performance or correctness, or designing post-pipeline transform steps.
npx skillsauth add michaelsvanbeek/personal-agent-skills data-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Tool | Best For | Install |
|------|----------|---------|
| Polars | Large datasets, performance-critical transforms, lazy evaluation, type safety | uv add polars |
| pandas | Quick exploration, broad ecosystem compatibility, small-to-medium data | uv add pandas |
| DuckDB | SQL-based analytics on Parquet/CSV without loading into memory | uv add duckdb |
Follow this sequence when encountering a new dataset:
import polars as pl
df = pl.read_parquet("data/raw/users.parquet")
print(f"Rows: {df.height:,}, Columns: {df.width}")
print(df.schema)
print(df.head(5))
null_report = df.select(
pl.all().null_count().name.suffix("_nulls"),
).unpivot().sort("value", descending=True)
print(null_report)
# Percentage complete
completeness = df.select(
(1 - pl.all().null_count() / df.height).name.suffix("_complete"),
).unpivot().sort("value")
print(completeness)
# Numeric columns
print(df.select(pl.col(pl.NUMERIC_DTYPES)).describe())
# Categorical value counts
for col in df.select(pl.col(pl.String)).columns:
print(f"\n{col}:")
print(df.group_by(col).len().sort("len", descending=True).head(10))
# Quick histogram via DuckDB for large data
import duckdb
duckdb.sql("""
SELECT histogram(amount, 20) AS amount_hist
FROM 'data/raw/orders.parquet'
""").show()
# Outlier detection with IQR
q1, q3 = df.select(
pl.col("amount").quantile(0.25).alias("q1"),
pl.col("amount").quantile(0.75).alias("q3"),
).row(0)
iqr = q3 - q1
outliers = df.filter(
(pl.col("amount") < q1 - 1.5 * iqr) | (pl.col("amount") > q3 + 1.5 * iqr)
)
print(f"Outliers: {outliers.height} rows ({outliers.height / df.height:.1%})")
daily = (
df.with_columns(pl.col("created_at").cast(pl.Date).alias("date"))
.group_by("date")
.agg(pl.len().alias("count"))
.sort("date")
)
print(daily)
def clean_users(df: pl.DataFrame) -> pl.DataFrame:
"""Clean raw user data."""
return (
df
# Drop exact duplicates
.unique()
# Normalize text columns
.with_columns(
pl.col("email").str.to_lowercase().str.strip_chars(),
pl.col("name").str.strip_chars(),
)
# Filter invalid rows
.filter(pl.col("email").str.contains(r"^[^@]+@[^@]+\.[^@]+$"))
# Cast types
.with_columns(
pl.col("created_at").str.to_datetime("%Y-%m-%dT%H:%M:%SZ"),
)
# Drop columns not needed downstream
.drop("raw_html", "debug_info")
)
| Strategy | When | |----------|------| | Drop rows | Missing data is rare (<5%) and random | | Fill with default | Business logic defines a sensible default (e.g., 0 for missing counts) | | Forward/backward fill | Time-series data with sparse readings | | Flag as missing | Downstream analysis needs to distinguish "0" from "unknown" |
# Fill with default
df = df.with_columns(pl.col("country").fill_null("Unknown"))
# Forward fill time-series
df = df.sort("timestamp").with_columns(pl.col("temperature").forward_fill())
# Flag missing
df = df.with_columns(pl.col("amount").is_null().alias("amount_missing"))
summary = (
df.group_by("category")
.agg(
pl.len().alias("count"),
pl.col("amount").sum().alias("total"),
pl.col("amount").mean().alias("avg"),
pl.col("amount").median().alias("median"),
pl.col("amount").std().alias("std"),
pl.col("amount").quantile(0.95).alias("p95"),
)
.sort("total", descending=True)
)
df = df.with_columns(
# Running total per user
pl.col("amount")
.cum_sum()
.over("user_id")
.alias("running_total"),
# Rank within category
pl.col("amount")
.rank(method="dense", descending=True)
.over("category")
.alias("rank_in_category"),
)
monthly = (
df.sort("date")
.group_by_dynamic("date", every="1mo")
.agg(
pl.col("amount").sum().alias("monthly_total"),
pl.col("user_id").n_unique().alias("unique_users"),
)
)
Query Parquet files directly without loading into Python memory:
import duckdb
# Aggregate across partitioned Parquet files
result = duckdb.sql("""
SELECT
date_trunc('month', created_at) AS month,
country,
count(*) AS orders,
sum(amount) AS revenue,
avg(amount) AS avg_order_value
FROM 'data/raw/orders/**/*.parquet'
GROUP BY 1, 2
ORDER BY 1, 4 DESC
""")
# Convert to Polars for further manipulation
df = result.pl()
import duckdb
import polars as pl
# Register Polars DataFrame as a DuckDB table
df = pl.read_parquet("data/cleaned/users.parquet")
duckdb.sql("SELECT * FROM df WHERE country = 'US' LIMIT 10").show()
DuckDB automatically detects Polars and pandas DataFrames by variable name — no explicit registration needed.
Every analysis notebook should follow this structure:
import polars as pl
import duckdb
from pathlib import Path
# Display settings
pl.Config.set_tbl_rows(20)
pl.Config.set_fmt_str_lengths(80)
DATA_DIR = Path("data")
01_user_churn_analysis.ipynb, not Untitled3.ipynb.uv environment.For large datasets, use lazy evaluation to let Polars optimize the query plan:
result = (
pl.scan_parquet("data/raw/events/*.parquet") # lazy — no data loaded yet
.filter(pl.col("event_type") == "purchase")
.group_by("user_id")
.agg(pl.col("amount").sum().alias("total_spent"))
.sort("total_spent", descending=True)
.head(100)
.collect() # executes the optimized plan
)
scan_parquet (lazy) instead of read_parquet (eager) for datasets larger than available RAM.pl.scan_parquet(...).select(["id", "amount"]) avoids loading unused columns.collect(streaming=True) for datasets that don't fit in memory even after filtering.development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.