skills/batch-processing-optimizer/SKILL.md
Spark, pandas, polars, DuckDB optimization for batch data processing. Activate on: batch processing, Spark optimization, polars, DuckDB, pandas performance, data frame, shuffle, partition, memory optimization. NOT for: streaming pipelines (use streaming-pipeline-architect), warehouse queries (use data-warehouse-optimizer).
npx skillsauth add curiositech/windags-skills batch-processing-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Optimize batch data processing workloads using Spark, Polars, DuckDB, and pandas with focus on memory efficiency, parallelism, and cost reduction.
Activate on: "batch processing", "Spark optimization", "Polars", "DuckDB", "pandas performance", "data frame", "shuffle optimization", "partition skew", "memory optimization", "out of memory"
NOT for: Real-time streaming → streaming-pipeline-architect | Warehouse SQL tuning → data-warehouse-optimizer | Pipeline orchestration → airflow-dag-orchestrator
| Domain | Technologies | |--------|-------------| | Distributed | Apache Spark 3.5+, Dask, Ray | | Single-Node | DuckDB 1.1+, Polars 1.x, pandas 2.2+ | | File Formats | Parquet, Arrow IPC, Delta Lake, Iceberg | | Optimization | AQE (Spark), lazy evaluation (Polars), columnar scans | | Cloud | Databricks, EMR, Dataproc, serverless Spark |
Data Size?
├─ < 10 GB → DuckDB (SQL) or Polars (DataFrame)
│ Single machine, zero setup, fastest iteration
│
├─ 10-100 GB → Polars (lazy) or DuckDB (out-of-core)
│ Still single machine with spill-to-disk
│
└─ > 100 GB → Spark (distributed)
Multi-node cluster, shuffle-based joins
Complexity?
├─ SQL-centric → DuckDB (fastest SQL engine for analytics)
├─ DataFrame → Polars (10x faster than pandas, lazy evaluation)
└─ Complex ML → Spark + MLlib or Spark + Ray
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.config("spark.sql.adaptive.skewJoin.enabled", "true") \
.getOrCreate()
# GOOD: broadcast small dimension table (< 100MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_dim_df), "key")
# GOOD: predicate pushdown — filter before join
orders = spark.read.parquet("s3://data/orders/") \
.filter(F.col("order_date") >= "2026-01-01") \
.select("order_id", "customer_id", "amount") # column pruning
# BAD: collect() on large dataset — causes OOM on driver
# all_data = large_df.collect() # NEVER do this
# GOOD: write partitioned output
result.repartition(200) \
.write.mode("overwrite") \
.partitionBy("order_date") \
.parquet("s3://output/results/")
import polars as pl
# Lazy mode: builds query plan, optimizes, then executes
result = (
pl.scan_parquet("data/orders/*.parquet") # lazy scan
.filter(pl.col("order_date") >= "2026-01-01")
.join(
pl.scan_parquet("data/customers/*.parquet"),
on="customer_id",
how="inner"
)
.group_by("region")
.agg([
pl.col("amount").sum().alias("total_revenue"),
pl.col("order_id").n_unique().alias("order_count"),
])
.sort("total_revenue", descending=True)
.collect() # executes optimized plan
)
# Polars optimizes: predicate pushdown, projection pushdown,
# join reordering — all automatically via lazy evaluation
df.collect() or df.toPandas() on large Spark DataFrames causes OOM; aggregate firstcollect() on large datasets (aggregate before collecting).explain(), DuckDB EXPLAIN ANALYZEtools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.