Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/data-exploration

Name: data-exploration
Author: harsh040506

business/data-analytics/data-exploration/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library data-exploration

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Exploration Skill

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

Data Profiling Methodology

Phase 1: Structural Understanding

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification: Categorize each column as one of:

Identifier: Unique keys, foreign keys, entity IDs
Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
Metric: Quantitative values for measurement (revenue, count, duration, score)
Temporal: Dates and timestamps (created_at, updated_at, event_date)
Text: Free-form text fields (description, notes, name)
Boolean: True/false flags
Structural: JSON, arrays, nested structures

Phase 2: Column-Level Profiling

For each column, compute:

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Phase 3: Relationship Discovery

After profiling individual columns:

Foreign key candidates: ID columns that might link to other tables
Hierarchies: Columns that form natural drill-down paths (country > state > city)
Correlations: Numeric columns that move together
Derived columns: Columns that appear to be computed from others
Redundant columns: Columns with identical or near-identical information

Quality Assessment Framework

Completeness Score

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

Consistency Checks

Look for:

Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency: Numbers stored as strings, dates in various formats
Referential integrity: Foreign keys that don't match any parent record
Business rule violations: Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency: Status = "completed" but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values: Suspiciously high frequency of a single value
Stale data: Updated_at shows no recent changes in an active system
Impossible values: Ages > 150, dates in the far future, negative durations
Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

Normal: Mean and median are close, bell-shaped
Skewed right: Long tail of high values (common for revenue, session duration)
Skewed left: Long tail of low values (less common)
Bimodal: Two peaks (suggests two distinct populations)
Power law: Few very large values, many small ones (common for user activity)
Uniform: Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

Trend: Sustained upward or downward movement
Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects: Weekday vs. weekend differences
Holiday effects: Drops or spikes around known holidays
Change points: Sudden shifts in level or trend
Anomalies: Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

Compute correlation matrix for all metric pairs
Flag strong correlations (|r| > 0.7) for investigation
Note: Correlation does not imply causation -- flag this explicitly
Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

harsh040506/data-exploration

business/data-analytics/data-exploration/SKILL.md

Profile and explore datasets to understand their shape, quality, and patterns before analysis. Use when encountering a new dataset, assessing data quality, discovering column distributions, identifying nulls and outliers, or deciding which dimensions to analyze.

2 stars

testing

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library data-exploration

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 25, 2026, 9:03 PM43.6s1 file scanned

SKILL.md

name:: data-exploration
description:: Profile and explore datasets to understand their shape, quality, and patterns before analysis. Use when encountering a new dataset, assessing data quality, discovering column distributions, identifying nulls and outliers, or deciding which dimensions to analyze.

Data Exploration Skill

Systematic methodology for profiling datasets, assessing data quality, discovering patterns, and understanding schemas.

Data Profiling Methodology

Phase 1: Structural Understanding

Before analyzing any data, understand its structure:

Table-level questions:

How many rows and columns?
What is the grain (one row per what)?
What is the primary key? Is it unique?
When was the data last updated?
How far back does the data go?

Column classification: Categorize each column as one of:

Identifier: Unique keys, foreign keys, entity IDs
Dimension: Categorical attributes for grouping/filtering (status, type, region, category)
Metric: Quantitative values for measurement (revenue, count, duration, score)
Temporal: Dates and timestamps (created_at, updated_at, event_date)
Text: Free-form text fields (description, notes, name)
Boolean: True/false flags
Structural: JSON, arrays, nested structures

Phase 2: Column-Level Profiling

For each column, compute:

All columns:

Null count and null rate
Distinct count and cardinality ratio (distinct / total)
Most common values (top 5-10 with frequencies)
Least common values (bottom 5 to spot anomalies)

Numeric columns (metrics):

min, max, mean, median (p50)
standard deviation
percentiles: p1, p5, p25, p75, p95, p99
zero count
negative count (if unexpected)

String columns (dimensions, text):

min length, max length, avg length
empty string count
pattern analysis (do values follow a format?)
case consistency (all upper, all lower, mixed?)
leading/trailing whitespace count

Date/timestamp columns:

min date, max date
null dates
future dates (if unexpected)
distribution by month/week
gaps in time series

Boolean columns:

true count, false count, null count
true rate

Phase 3: Relationship Discovery

After profiling individual columns:

Foreign key candidates: ID columns that might link to other tables
Hierarchies: Columns that form natural drill-down paths (country > state > city)
Correlations: Numeric columns that move together
Derived columns: Columns that appear to be computed from others
Redundant columns: Columns with identical or near-identical information

Quality Assessment Framework

Completeness Score

Rate each column:

Complete (>99% non-null): Green
Mostly complete (95-99%): Yellow -- investigate the nulls
Incomplete (80-95%): Orange -- understand why and whether it matters
Sparse (<80%): Red -- may not be usable without imputation

Consistency Checks

Look for:

Value format inconsistency: Same concept represented differently ("USA", "US", "United States", "us")
Type inconsistency: Numbers stored as strings, dates in various formats
Referential integrity: Foreign keys that don't match any parent record
Business rule violations: Negative quantities, end dates before start dates, percentages > 100
Cross-column consistency: Status = "completed" but completed_at is null

Accuracy Indicators

Red flags that suggest accuracy issues:

Placeholder values: 0, -1, 999999, "N/A", "TBD", "test", "xxx"
Default values: Suspiciously high frequency of a single value
Stale data: Updated_at shows no recent changes in an active system
Impossible values: Ages > 150, dates in the far future, negative durations
Round number bias: All values ending in 0 or 5 (suggests estimation, not measurement)

Timeliness Assessment

When was the table last updated?
What is the expected update frequency?
Is there a lag between event time and load time?
Are there gaps in the time series?

Pattern Discovery Techniques

Distribution Analysis

For numeric columns, characterize the distribution:

Normal: Mean and median are close, bell-shaped
Skewed right: Long tail of high values (common for revenue, session duration)
Skewed left: Long tail of low values (less common)
Bimodal: Two peaks (suggests two distinct populations)
Power law: Few very large values, many small ones (common for user activity)
Uniform: Roughly equal frequency across range (often synthetic or random)

Temporal Patterns

For time series data, look for:

Trend: Sustained upward or downward movement
Seasonality: Repeating patterns (weekly, monthly, quarterly, annual)
Day-of-week effects: Weekday vs. weekend differences
Holiday effects: Drops or spikes around known holidays
Change points: Sudden shifts in level or trend
Anomalies: Individual data points that break the pattern

Segmentation Discovery

Identify natural segments by:

Finding categorical columns with 3-20 distinct values
Comparing metric distributions across segment values
Looking for segments with significantly different behavior
Testing whether segments are homogeneous or contain sub-segments

Correlation Exploration

Between numeric columns:

Compute correlation matrix for all metric pairs
Flag strong correlations (|r| > 0.7) for investigation
Note: Correlation does not imply causation -- flag this explicitly
Check for non-linear relationships (e.g., quadratic, logarithmic)

Schema Understanding and Documentation

Schema Documentation Template

When documenting a dataset for team use:

## Table: [schema.table_name]

**Description**: [What this table represents]
**Grain**: [One row per...]
**Primary Key**: [column(s)]
**Row Count**: [approximate, with date]
**Update Frequency**: [real-time / hourly / daily / weekly]
**Owner**: [team or person responsible]

### Key Columns

| Column | Type | Description | Example Values | Notes |
|--------|------|-------------|----------------|-------|
| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |
| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |
| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |
| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |

### Relationships
- Joins to `users` on `user_id`
- Joins to `products` on `product_id`
- Parent of `event_details` (1:many on event_id)

### Known Issues
- [List any known data quality issues]
- [Note any gotchas for analysts]

### Common Query Patterns
- [Typical use cases for this table]

Schema Exploration Queries

When connected to a data warehouse, use these patterns to discover schema:

-- List all tables in a schema (PostgreSQL)
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;

-- Column details (PostgreSQL)
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'my_table'
ORDER BY ordinal_position;

-- Table sizes (PostgreSQL)
SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- Row counts for all tables (general pattern)
-- Run per-table: SELECT COUNT(*) FROM table_name

Lineage and Dependencies

When exploring an unfamiliar data environment:

Start with the "output" tables (what reports or dashboards consume)
Trace upstream: What tables feed into them?
Identify raw/staging/mart layers
Map the transformation chain from raw data to analytical tables
Note where data is enriched, filtered, or aggregated

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/business/data-analytics/data-exploration ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT