skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/polars/SKILL.md
Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research polarsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Polars DataFrame library for high-performance data manipulation in Python. Covers lazy/eager execution, expressions, I/O (CSV, Parquet, JSON, database), aggregations, joins, string/datetime operations, pandas/NumPy interop, and performance optimization. Use when working with Polars DataFrames, migrating from pandas, reading Parquet files, or optimizing data pipeline performance.
Comprehensive skill for high-performance data manipulation with Polars. Use decision trees below to find the right guidance, then load detailed references.
Polars is a fast DataFrame library for Python (and Rust):
This skill targets Polars 1.x (tested with 1.37.1). Key changes from 0.x:
apply renamed to map_elements (0.19+)groupby renamed to group_by (0.19+)melt renamed to unpivot (1.0+)pl.Utf8 is now pl.String (1.0+, Utf8 still works as alias)Each topic in ./references/ contains focused documentation:
| File | Purpose | When to Read |
|------|---------|--------------|
| quickstart.md | Installation, concepts, first DataFrame | Starting with Polars |
| dataframes-series.md | Creation, selection, filtering, modification | Basic data manipulation |
| io-data.md | CSV, Parquet, JSON, database I/O | Loading/saving data |
| expressions.md | Expression system, contexts, chaining | Understanding Polars idioms |
| aggregations-grouping.md | GroupBy, window functions, statistics | Summarizing data |
| joins-concat.md | Joins, concatenation, pivot/unpivot | Combining DataFrames |
| strings-datetime-categorical.md | String ops, datetime, categoricals | Type-specific operations |
| performance.md | Lazy execution, optimization, anti-patterns | Making code faster |
| interop.md | Pandas, NumPy, PyArrow, DuckDB | Working with other tools |
| gotchas.md | Common errors, anti-patterns, migration | Debugging issues |
quickstart.md then expressions.mdquickstart.md, expressions.md, then interop.mdperformance.md firstGetting started?
├─ Install Polars → ./references/quickstart.md
├─ Create first DataFrame → ./references/quickstart.md
├─ Understand lazy vs eager → ./references/quickstart.md
├─ Learn expression syntax → ./references/expressions.md
└─ Coming from Pandas → ./references/interop.md
Loading/saving data?
├─ Read CSV file → ./references/io-data.md
├─ Read Parquet (recommended) → ./references/io-data.md
├─ Read JSON/NDJSON → ./references/io-data.md
├─ Read from database → ./references/io-data.md
├─ Read multiple files (glob) → ./references/io-data.md
├─ Write to file → ./references/io-data.md
└─ Larger-than-memory data → ./references/performance.md
Filtering/selecting?
├─ Select columns by name → ./references/dataframes-series.md
├─ Select by pattern/regex → ./references/dataframes-series.md
├─ Select by data type → ./references/dataframes-series.md
├─ Filter rows by condition → ./references/dataframes-series.md
├─ Filter with multiple conditions → ./references/dataframes-series.md
├─ Handle null values → ./references/dataframes-series.md
└─ Add/modify columns → ./references/dataframes-series.md
Aggregating data?
├─ Basic statistics (sum, mean, etc.) → ./references/aggregations-grouping.md
├─ Group by columns → ./references/aggregations-grouping.md
├─ Multiple aggregations → ./references/aggregations-grouping.md
├─ Window functions (over) → ./references/aggregations-grouping.md
├─ Rolling/moving averages → ./references/aggregations-grouping.md
├─ Cumulative operations → ./references/aggregations-grouping.md
└─ Ranking within groups → ./references/aggregations-grouping.md
Combining data?
├─ Join two DataFrames → ./references/joins-concat.md
├─ Left/right/outer join → ./references/joins-concat.md
├─ Anti-join (not in) → ./references/joins-concat.md
├─ Concatenate vertically → ./references/joins-concat.md
├─ Pivot (long to wide) → ./references/joins-concat.md
└─ Unpivot/melt (wide to long) → ./references/joins-concat.md
Performance issues?
├─ Use lazy evaluation → ./references/performance.md
├─ Avoid row iteration → ./references/performance.md
├─ Reduce memory usage → ./references/performance.md
├─ Process large files → ./references/performance.md
├─ Optimize query plan → ./references/performance.md
└─ Common anti-patterns → ./references/performance.md
Having issues?
├─ Type errors → ./references/gotchas.md
├─ Null handling → ./references/gotchas.md
├─ Expression context errors → ./references/gotchas.md
├─ String operations → ./references/strings-datetime-categorical.md
├─ Date parsing issues → ./references/strings-datetime-categorical.md
├─ Performance problems → ./references/gotchas.md
├─ Pandas migration issues → ./references/gotchas.md
├─ Memory errors → ./references/gotchas.md
└─ General troubleshooting → ./references/gotchas.md
Important: In data research pipelines (see CLAUDE.md), Polars transformations are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
scripts/stage{N}_{type}/{step}_{task-name}.pyClosely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
agent_reference/SCRIPT_EXECUTION_REFERENCE.md — Script execution protocol and format with validationThe examples below show Polars syntax. In research workflows, wrap them in scripts following the file-first pattern.
import polars as pl
import polars.selectors as cs # For column selection by type
# Eager: immediate execution
df = pl.read_csv("data.csv")
# Lazy: deferred, optimized execution (preferred for large data)
lf = pl.scan_csv("data.csv")
df = lf.collect() # Execute when ready
# Select columns
df.select("a", "b")
df.select(pl.col("a"), pl.col("b"))
df.select(pl.all().exclude("id"))
# Filter rows
df.filter(pl.col("a") > 10)
df.filter((pl.col("a") > 10) & (pl.col("b") == "x"))
# Add/modify columns
df.with_columns(
(pl.col("a") * 2).alias("a_doubled"),
pl.col("b").str.to_uppercase().alias("b_upper")
)
# Conditional column
df.with_columns(
pl.when(pl.col("a") > 10)
.then(pl.lit("high"))
.otherwise(pl.lit("low"))
.alias("category")
)
# Group and aggregate
df.group_by("category").agg(
pl.col("value").sum().alias("total"),
pl.col("value").mean().alias("average"),
pl.len().alias("count")
)
| Function | Purpose |
|----------|---------|
| pl.col("name") | Reference a column |
| pl.lit(value) | Literal value |
| pl.all() | All columns |
| pl.exclude("col") | All except specified |
| pl.len() | Row count |
| pl.when().then().otherwise() | Conditional logic |
| .alias("name") | Rename result |
| .cast(pl.Int64) | Convert type |
| Type | Description |
|------|-------------|
| pl.Int64, pl.Int32 | Integers |
| pl.Float64, pl.Float32 | Floats |
| pl.String (or pl.Utf8) | Strings |
| pl.Boolean | True/False |
| pl.Date, pl.Datetime | Dates and timestamps |
| pl.Duration | Time differences |
| pl.Categorical | Categorical strings |
| pl.List | List of values |
| pl.Struct | Named fields |
# I/O
df = pl.read_csv/parquet/json("file")
lf = pl.scan_csv/parquet/ndjson("file") # Lazy
df.write_csv/parquet/json("file")
# Selection
df.select("a", "b")
df.select(cs.numeric()) # By type
# Filtering
df.filter(pl.col("a") > 1)
# Aggregation
df.group_by("key").agg(pl.col("val").sum())
# Joining
df1.join(df2, on="key", how="left")
# Sorting
df.sort("col", descending=True)
# Lazy execution
lf.collect() # Run query
lf.explain() # Show plan
| Topic | Reference File |
|-------|---------------|
| Installation | ./references/quickstart.md |
| DataFrame Creation | ./references/quickstart.md |
| Lazy vs Eager | ./references/quickstart.md |
| Column Selection | ./references/dataframes-series.md |
| Row Filtering | ./references/dataframes-series.md |
| Adding Columns | ./references/dataframes-series.md |
| CSV Files | ./references/io-data.md |
| Parquet Files | ./references/io-data.md |
| Database Connections | ./references/io-data.md |
| Expressions | ./references/expressions.md |
| Method Chaining | ./references/expressions.md |
| Contexts | ./references/expressions.md |
| GroupBy | ./references/aggregations-grouping.md |
| Window Functions | ./references/aggregations-grouping.md |
| Rolling Windows | ./references/aggregations-grouping.md |
| Joins | ./references/joins-concat.md |
| Concatenation | ./references/joins-concat.md |
| Pivot/Unpivot | ./references/joins-concat.md |
| String Operations | ./references/strings-datetime-categorical.md |
| Datetime Handling | ./references/strings-datetime-categorical.md |
| Categorical Data | ./references/strings-datetime-categorical.md |
| Query Optimization | ./references/performance.md |
| Memory Management | ./references/performance.md |
| Anti-Patterns | ./references/performance.md |
| Pandas Conversion | ./references/interop.md |
| NumPy Integration | ./references/interop.md |
| DuckDB Integration | ./references/interop.md |
| Type Errors | ./references/gotchas.md |
| qcut Label Gotcha | ./references/gotchas.md |
| Null Handling Issues | ./references/gotchas.md |
| Expression Context Errors | ./references/gotchas.md |
| Performance Anti-Patterns | ./references/gotchas.md |
| Migration from Pandas | ./references/gotchas.md |
| Memory Issues | ./references/gotchas.md |
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Vink, R. et al. Polars: Blazingly fast DataFrames [Computer software]. https://pola.rs/
Cite when: Polars is the core data processing engine for the analysis (typically always true in DAAF pipelines). Do not cite when: Only used for trivial file I/O in a script primarily using another tool.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.