Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oaustegard/exploring-data

Name: exploring-data
Author: oaustegard

exploring-data/SKILL.md

npx skillsauth add oaustegard/claude-skills exploring-data

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Exploring Data

0. Route by size FIRST

ls -la <filepath>   # or: wc -l for row estimate

< 200MB and < ~5M rows → ydata-profiling path (section A). Exact stats, interactive HTML.
Larger → large-file path (section B). ydata-profiling loads everything into pandas and will crawl or OOM; the DuckDB/sketch path runs in fixed memory at any size.
Task-specific ops (any size): duplicates, join feasibility, drift → section C.

A. Standard path (ydata-profiling)

1. Check if installed (instant)

bash /mnt/skills/user/exploring-data/scripts/check_install.sh

Returns: installed or not_installed

2. Install if needed (one-time, ~19s)

if [ "$(bash /mnt/skills/user/exploring-data/scripts/check_install.sh)" = "not_installed" ]; then
    bash /mnt/skills/user/exploring-data/scripts/install_ydata.sh
fi

3. Run analysis (always generates JSON + HTML by default)

bash /mnt/skills/user/exploring-data/scripts/analyze.sh <filepath> [minimal|full] [html|json]

Defaults: minimal + html (also generates JSON)

Output:

eda_report.html - Interactive report for user
eda_report.json - Machine-readable for Claude analysis

4. If Claude needs to analyze (user asks "what do you think?" etc.)

python /mnt/skills/user/exploring-data/scripts/summarize_insights.py /mnt/user-data/outputs/eda_report.json

Claude should read the stdout markdown summary, NOT the full JSON report.

5. Present findings visually (don't just hand over the ydata HTML)

The ydata report is exhaustive but dense; a link to it is a weak deliverable. Turn the JSON into a compact dashboard of the findings that matter:

python3 /mnt/skills/user/exploring-data/scripts/visualize_findings.py \
    /mnt/user-data/outputs/eda_report.json
# → /mnt/user-data/outputs/eda_findings.html

Emits a single self-contained HTML file (Chart.js from cdnjs, dark-mode aware): missingness by column (tiered good/bad), the most skewed or zero-inflated numeric distributions as small-multiple histograms, and the largest categorical breakdowns. --top N caps charts per category (default 6). Also reads profile_large.py --json output, so the large-file path gets the same treatment.

Present BOTH files: eda_findings.html for the headline read, eda_report.html for the full drill-down. In a chat surface that renders inline visuals, prefer rendering the two or three findings that actually answer the user's question as inline charts over linking a file — a link the user has to open is the weakest form of "showing" data.

Modes

Minimal (default, 5-10s): overview, variable analysis, correlations, missing values, alerts Full (10-20s): minimal + scatter matrices, sample data, character analysis

Full-mode triggers: "comprehensive analysis", "detailed EDA", "full profiling", "deep analysis". Otherwise minimal.

Time series

If the data has a datetime index/column and the user cares about temporal behavior (gaps, trends, seasonality, autocorrelation), pass tsmode=True to ProfileReport — run the venv python directly instead of analyze.sh:

ProfileReport(df, tsmode=True, sortby="<datetime_col>", title=...)

This adds gap detection, stationarity and seasonality checks that the default report omits.

Small-file drift

Comparing two versions of a dataset that BOTH fit in memory: use ydata's native compare — ProfileReport(df_a).compare(ProfileReport(df_b)).to_file(...). For files too big to load, or comparing against a months-old file you no longer have, use the sketch snapshot/drift ops in section C.

B. Large-file path (DuckDB, fixed memory)

1. Install deps (idempotent, ~10s first time)

bash /mnt/skills/user/exploring-data/scripts/install_large.sh

2. Profile

python3 /mnt/skills/user/exploring-data/scripts/profile_large.py <file> [--json out.json]

Streams the file through DuckDB: per-column null%, approximate distinct counts (HLL), min/max/mean, approximate quantiles (t-digest) for numerics, top-5 values for strings, plus quality flags (mostly-null, constant, id-like columns). Markdown lands on stdout — read it directly, no summarize step needed. Handles csv/tsv/parquet/json/ndjson. 1M rows profiles in seconds; memory is flat regardless of file size.

For ad-hoc follow-up queries on the same large file, use DuckDB SQL directly (duckdb.connect().execute("SELECT ... FROM read_csv_auto('...')")) rather than loading pandas.

C. Sketch ops (any file size, fixed memory)

All via scripts/sketch_ops.py (deps from install_large.sh). These answer questions profilers don't:

Near-duplicate rows

python3 sketch_ops.py dups <file> [--threshold 0.9] [--cols a,b,c]

Exact duplicates counted by hash; near-duplicates via MinHash LSH over row tokens. Use --cols to restrict to the columns that define identity.

Key overlap / join feasibility

python3 sketch_ops.py overlap <fileA> <fileB> --key <col> [--key-b <col>]

Theta sketches per key column → estimated intersection, Jaccard, and "% of A's keys in B" both ways — answers "will this join hold?" without loading either file.

Drift vs stored baseline

python3 sketch_ops.py snapshot <file> --out baseline.sketch.json   # ~20KB
python3 sketch_ops.py drift <newfile> --baseline baseline.sketch.json

Snapshot serializes HLL (all columns) + KLL quantile sketches (numeric columns) to a small JSON. Drift reports schema changes, >10% shifts in distinct counts, and IQR-relative quantile movement. The snapshot is a few KB — store it (repo, memory) and diff next month's delivery against it without keeping the original file.

Note: snapshot/dups stream rows through Python (~1M rows in a few seconds); profile_large is pure DuckDB and faster. For a quick look at a big file, profile first, sketch ops only when the question calls for them.

oaustegard/exploring-data

exploring-data/SKILL.md

Exploratory data analysis. Use when users upload .csv/.xlsx/.json/.parquet files or request "explore data", "analyze dataset", "EDA", "profile data". Small files get ydata-profiling HTML/JSON reports; large files (>200MB or >5M rows) get fixed-memory DuckDB/sketch profiling. Also covers near-duplicate row detection, cross-file key overlap ("can these join?"), dataset drift vs a stored baseline, and time-series profiling.

132 stars

development

Updated Jul 21, 2026

$ install --global

skillsauth

npx skillsauth add oaustegard/claude-skills exploring-data

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 21, 2026, 6:26 AM37.6s12 files scanned

SKILL.md

name:: exploring-data
description:: Exploratory data analysis. Use when users upload .csv/.xlsx/.json/.parquet files or request "explore data", "analyze dataset", "EDA", "profile data". Small files get ydata-profiling HTML/JSON reports; large files (>200MB or >5M rows) get fixed-memory DuckDB/sketch profiling. Also covers near-duplicate row detection, cross-file key overlap ("can these join?"), dataset drift vs a stored baseline, and time-series profiling.
version:: 0.1.1

Exploring Data

0. Route by size FIRST

ls -la <filepath>   # or: wc -l for row estimate

< 200MB and < ~5M rows → ydata-profiling path (section A). Exact stats, interactive HTML.
Larger → large-file path (section B). ydata-profiling loads everything into pandas and will crawl or OOM; the DuckDB/sketch path runs in fixed memory at any size.
Task-specific ops (any size): duplicates, join feasibility, drift → section C.

A. Standard path (ydata-profiling)

1. Check if installed (instant)

bash /mnt/skills/user/exploring-data/scripts/check_install.sh

Returns: installed or not_installed

2. Install if needed (one-time, ~19s)

if [ "$(bash /mnt/skills/user/exploring-data/scripts/check_install.sh)" = "not_installed" ]; then
    bash /mnt/skills/user/exploring-data/scripts/install_ydata.sh
fi

3. Run analysis (always generates JSON + HTML by default)

bash /mnt/skills/user/exploring-data/scripts/analyze.sh <filepath> [minimal|full] [html|json]

Defaults: minimal + html (also generates JSON)

Output:

eda_report.html - Interactive report for user
eda_report.json - Machine-readable for Claude analysis

4. If Claude needs to analyze (user asks "what do you think?" etc.)

python /mnt/skills/user/exploring-data/scripts/summarize_insights.py /mnt/user-data/outputs/eda_report.json

Claude should read the stdout markdown summary, NOT the full JSON report.

5. Present findings visually (don't just hand over the ydata HTML)

The ydata report is exhaustive but dense; a link to it is a weak deliverable. Turn the JSON into a compact dashboard of the findings that matter:

python3 /mnt/skills/user/exploring-data/scripts/visualize_findings.py \
    /mnt/user-data/outputs/eda_report.json
# → /mnt/user-data/outputs/eda_findings.html

Modes

Minimal (default, 5-10s): overview, variable analysis, correlations, missing values, alerts Full (10-20s): minimal + scatter matrices, sample data, character analysis

Full-mode triggers: "comprehensive analysis", "detailed EDA", "full profiling", "deep analysis". Otherwise minimal.

Time series

ProfileReport(df, tsmode=True, sortby="<datetime_col>", title=...)

This adds gap detection, stationarity and seasonality checks that the default report omits.

Small-file drift

B. Large-file path (DuckDB, fixed memory)

1. Install deps (idempotent, ~10s first time)

bash /mnt/skills/user/exploring-data/scripts/install_large.sh

2. Profile

python3 /mnt/skills/user/exploring-data/scripts/profile_large.py <file> [--json out.json]

For ad-hoc follow-up queries on the same large file, use DuckDB SQL directly (duckdb.connect().execute("SELECT ... FROM read_csv_auto('...')")) rather than loading pandas.

C. Sketch ops (any file size, fixed memory)

All via scripts/sketch_ops.py (deps from install_large.sh). These answer questions profilers don't:

Near-duplicate rows

python3 sketch_ops.py dups <file> [--threshold 0.9] [--cols a,b,c]

Exact duplicates counted by hash; near-duplicates via MinHash LSH over row tokens. Use --cols to restrict to the columns that define identity.

Key overlap / join feasibility

python3 sketch_ops.py overlap <fileA> <fileB> --key <col> [--key-b <col>]

Theta sketches per key column → estimated intersection, Jaccard, and "% of A's keys in B" both ways — answers "will this join hold?" without loading either file.

Drift vs stored baseline

python3 sketch_ops.py snapshot <file> --out baseline.sketch.json   # ~20KB
python3 sketch_ops.py drift <newfile> --baseline baseline.sketch.json

Related Skills

oaustegard/writing-instructions

development

VerifiedTrustedCommunity

Write effective instructions for Claude: project instructions, standalone prompts, and skill content. Use when users need help writing prompts, setting up project instructions, choosing between instruction formats, or improving how they communicate with Claude. Covers writing principles, model-aware calibration, and format selection. For building and testing complete skills, use skill-creator instead.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/writing-instructions

oaustegard/finding-skills

data-ai

VerifiedTrustedCommunity

Discover and load skills on demand from /mnt/skills/user/. Use when you need a capability but don't know which skill provides it, when the boot-emitted skill list is names-only and you need a full description, or when you want to list the catalog. Verbs are list (names only), search (rank by name/description match against a query), and show (emit the full SKILL.md for a named skill).

134SKILL.mdUpdated Jul 26, 2026

oaustegard/finding-skills

oaustegard/transcribing-images

documentation

VerifiedTrustedCommunity

Reads the visual content of slides, pages, and images the way a human would, not just their embedded text. Use when a PPTX or PDF has image slides, screenshots, charts, scanned figures, or flattened-to-image layouts that the built-in pptx/pdf skills read as empty; when asked to transcribe, describe, OCR, or extract what is shown in an image, slide deck, or document page; or when embedded-text extraction returned little or nothing from a visually rich file. Triggers on 'read this deck', 'what's on these slides', 'transcribe', 'OCR', 'extract text from image', 'describe this chart/diagram', .pptx/.pdf/.png/.jpg with visual content.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/transcribing-images

oaustegard/svg-portrait-mode

development

VerifiedTrustedCommunity

Portrait Mode for SVGs — foveated vectorization with 4-zone selective detail. Combines vision annotations, MediaPipe segmentation/landmarks, and optional saliency. Like phone portrait mode, but vectorized. Use when vectorizing a portrait or photo where subject detail should outrank background detail.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/svg-portrait-mode

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oaustegard/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/exploring-data ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oaustegard/claude-skills

132 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT