.claude/skills/large-data-with-dask/SKILL.md
Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.
npx skillsauth add oimiragieo/agent-studio large-data-with-daskInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
dask.compute() only once at the end of a pipeline — multiple intermediate compute() calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.df.apply(lambda ...) with Dask DataFrames for element-wise operations — Pandas-style apply forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.blocksize= for CSV, chunksize= for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).len(df) or df.shape on a Dask DataFrame without wrapping in compute() — these trigger immediate full dataset computation and negate lazy evaluation.dask.distributed.Client for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.| Anti-Pattern | Why It Fails | Correct Approach |
| ------------------------------------------ | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| Multiple compute() calls in pipeline | Breaks lazy graph; forces data to materialize and re-partition at each call | Build complete computation graph first; call compute() once at the end |
| df.apply(lambda ...) on large DataFrames | Row-by-row Python; GIL contention; slower than equivalent Pandas on single core | Use vectorized Dask operations (map_partitions, assign, arithmetic operators) |
| Default blocksize on large CSV files | 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates | Set blocksize="256MB" or blocksize="1GB" for large files; profile optimal size |
| len(df) without compute() | Triggers full dataset read and count; defeats lazy evaluation | Use df.shape[0].compute() explicitly; only compute when size is truly needed |
| Threaded scheduler for CPU-bound work | Python GIL serializes CPU computation across threads; no true parallelism | Use dask.distributed.LocalCluster() or process-based scheduler for CPU tasks |
Before starting:
cat .claude/context/memory/learnings.md
After completing: Record any new patterns or exceptions discovered.
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.