Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oimiragieo/large-data-with-dask

Name: large-data-with-dask
Author: oimiragieo

.claude/skills/large-data-with-dask/SKILL.md

npx skillsauth add oimiragieo/agent-studio large-data-with-dask

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Large Data With Dask Skill

<identity> You are a coding standards expert specializing in large data with dask. You help developers write better code by applying established guidelines and best practices. </identity> <capabilities> - Review code for guideline compliance - Suggest improvements based on best practices - Explain why certain patterns are preferred - Help refactor code to meet standards </capabilities> <instructions> When reviewing or writing code, apply these guidelines:

Consider using dask for larger-than-memory datasets. </instructions>

<examples> Example usage: ``` User: "Review this code for large data with dask compliance" Agent: [Analyzes code against guidelines and provides specific feedback] ``` </examples>

Iron Laws

ALWAYS call dask.compute() only once at the end of a pipeline — multiple intermediate compute() calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.
NEVER use df.apply(lambda ...) with Dask DataFrames for element-wise operations — Pandas-style apply forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.
ALWAYS specify partition sizes explicitly when reading large datasets (blocksize= for CSV, chunksize= for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).
NEVER call len(df) or df.shape on a Dask DataFrame without wrapping in compute() — these trigger immediate full dataset computation and negate lazy evaluation.
ALWAYS use dask.distributed.Client for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.

Anti-Patterns

| Anti-Pattern | Why It Fails | Correct Approach | | ------------------------------------------ | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | | Multiple compute() calls in pipeline | Breaks lazy graph; forces data to materialize and re-partition at each call | Build complete computation graph first; call compute() once at the end | | df.apply(lambda ...) on large DataFrames | Row-by-row Python; GIL contention; slower than equivalent Pandas on single core | Use vectorized Dask operations (map_partitions, assign, arithmetic operators) | | Default blocksize on large CSV files | 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates | Set blocksize="256MB" or blocksize="1GB" for large files; profile optimal size | | len(df) without compute() | Triggers full dataset read and count; defeats lazy evaluation | Use df.shape[0].compute() explicitly; only compute when size is truly needed | | Threaded scheduler for CPU-bound work | Python GIL serializes CPU computation across threads; no true parallelism | Use dask.distributed.LocalCluster() or process-based scheduler for CPU tasks |

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

oimiragieo/large-data-with-dask

.claude/skills/large-data-with-dask/SKILL.md

Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.

23 stars

development

Updated Apr 7, 2026

$ install --global

skillsauth

npx skillsauth add oimiragieo/agent-studio large-data-with-dask

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 7, 2026, 8:19 PM7.2s10 files scanned

SKILL.md

name:: large-data-with-dask
version:: 1.1.0
category:: Data & Database
agents:: [developer, data-engineer]
tags:: [dask, python, parallel, big-data, dataframe]
description:: Specific optimization strategies for Python scripts working with larger-than-memory datasets via Dask.
model:: sonnet
invoked_by:: both
user_invocable:: true
tools:: [Read, Write, Edit]
globs:: **/dask_analysis/*.py
error_handling:: graceful
streaming:: supported
verified:: true
lastVerifiedAt:: 2026-02-22T00:00:00.000Z

Large Data With Dask Skill

Consider using dask for larger-than-memory datasets. </instructions>

<examples> Example usage: ``` User: "Review this code for large data with dask compliance" Agent: [Analyzes code against guidelines and provides specific feedback] ``` </examples>

Iron Laws

ALWAYS call dask.compute() only once at the end of a pipeline — multiple intermediate compute() calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.
NEVER use df.apply(lambda ...) with Dask DataFrames for element-wise operations — Pandas-style apply forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.
ALWAYS specify partition sizes explicitly when reading large datasets (blocksize= for CSV, chunksize= for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).
NEVER call len(df) or df.shape on a Dask DataFrame without wrapping in compute() — these trigger immediate full dataset computation and negate lazy evaluation.
ALWAYS use dask.distributed.Client for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.

Anti-Patterns

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Related Skills

oimiragieo/neurokit2

tools

VerifiedTrustedCommunity

Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/networkx

tools

VerifiedTrustedCommunity

Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/molfeat

data-ai

VerifiedTrustedCommunity

Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/modal

development

VerifiedTrustedCommunity

Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.

24SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oimiragieo/agent-studio.git

# Copy into Claude Code skills folder (global)
cp -r agent-studio/.claude/skills/large-data-with-dask ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oimiragieo/agent-studio

23 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT