Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/data-validation

Name: data-validation
Author: harsh040506

business/data-analytics/data-validation/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library data-validation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Data Validation Skill

Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.

Pre-Delivery QA Checklist

Run through this checklist before sharing any analysis with stakeholders.

Data Quality Checks

[ ] Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
[ ] Freshness: Data is current enough for the analysis. Noted the "as of" date.
[ ] Completeness: No unexpected gaps in time series or missing segments.
[ ] Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
[ ] Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
[ ] Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.

Calculation Checks

[ ] Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
[ ] Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
[ ] Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
[ ] Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.
[ ] Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
[ ] Subtotals sum: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).

Reasonableness Checks

[ ] Magnitude: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.
[ ] Trend continuity: No unexplained jumps or drops in time series.
[ ] Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
[ ] Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
[ ] Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.

Presentation Checks

[ ] Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
[ ] Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
[ ] Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
[ ] Caveat transparency: Known limitations and assumptions are stated explicitly.
[ ] Reproducibility: Someone else could recreate this analysis from the documentation provided.

Common Data Analysis Pitfalls

Join Explosion

The problem: A many-to-many join silently multiplies rows, inflating counts and sums.

How to detect:

-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)

How to prevent:

Always check row counts after joins
If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
Use COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joins

Survivorship Bias

The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.

Examples:

Analyzing user behavior of "current users" misses churned users
Looking at "companies using our product" ignores those who evaluated and left
Studying properties of "successful" outcomes without "unsuccessful" ones

How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.

Incomplete Period Comparison

The problem: Comparing a partial period to a full period.

Examples:

"January revenue is $500K vs. December's $800K" -- but January isn't over yet
"This week's signups are down" -- checked on Wednesday, comparing to a full prior week

How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.

Denominator Shifting

The problem: The denominator changes between periods, making rates incomparable.

Examples:

Conversion rate improves because you changed how you count "eligible" users
Churn rate changes because the definition of "active" was updated

How to prevent: Use consistent definitions across all compared periods. Note any definition changes.

Average of Averages

The problem: Averaging pre-computed averages gives wrong results when group sizes differ.

Example:

Group A: 100 users, average revenue $50
Group B: 10 users, average revenue $200
Wrong: Average of averages = ($50 + $200) / 2 = $125
Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64

How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.

Timezone Mismatches

The problem: Different data sources use different timezones, causing misalignment.

Examples:

Event timestamps in UTC vs. user-facing dates in local time
Daily rollups that use different cutoff times

How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.

Selection Bias in Segmentation

The problem: Segments are defined by the outcome you're measuring, creating circular logic.

Examples:

"Users who completed onboarding have higher retention" -- obviously, they self-selected
"Power users generate more revenue" -- they became power users BY generating revenue

How to prevent: Define segments based on pre-treatment characteristics, not outcomes.

Result Sanity Checking

Magnitude Checks

For any key number in your analysis, verify it passes the "smell test":

| Metric Type | Sanity Check | |---|---| | User counts | Does this match known MAU/DAU figures? | | Revenue | Is this in the right order of magnitude vs. known ARR? | | Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? | | Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? | | Averages | Is the average reasonable given what you know about the distribution? | | Percentages | Do segment percentages sum to ~100%? |

Cross-Validation Techniques

Calculate the same metric two different ways and verify they match
Spot-check individual records -- pick a few specific entities and trace their data manually
Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses
Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X?
Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?

Red Flags That Warrant Investigation

Any metric that changed by more than 50% period-over-period without an obvious cause
Counts or sums that are exact round numbers (suggests a filter or default value issue)
Rates exactly at 0% or 100% (may indicate incomplete data)
Results that perfectly confirm the hypothesis (reality is usually messier)
Identical values across time periods or segments (suggests the query is ignoring a dimension)

Documentation Standards for Reproducibility

Analysis Documentation Template

Every non-trivial analysis should include:

## Analysis: [Title]

### Question
[The specific question being answered]

### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])

### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]

### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]

### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]

### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]

### SQL Queries
[All queries used, with comments]

### Caveats
- [Things the reader should know before acting on this]

Code Documentation

For any code (SQL, Python) that may be reused:

"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""

Version Control for Analyses

Save queries and code in version control (git) or a shared docs system
Note the date of the data snapshot used
If an analysis is re-run with updated data, document what changed and why
Link to prior versions of recurring analyses for trend comparison

harsh040506/data-validation

business/data-analytics/data-validation/SKILL.md

QA an analysis before sharing with stakeholders — methodology checks, accuracy verification, and bias detection. Use when reviewing an analysis for errors, checking for survivorship bias, validating aggregation logic, or preparing documentation for reproducibility.

2 stars

testing

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library data-validation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 25, 2026, 9:03 PM42.9s1 file scanned

SKILL.md

name:: data-validation
description:: QA an analysis before sharing with stakeholders — methodology checks, accuracy verification, and bias detection. Use when reviewing an analysis for errors, checking for survivorship bias, validating aggregation logic, or preparing documentation for reproducibility.

Data Validation Skill

Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.

Pre-Delivery QA Checklist

Run through this checklist before sharing any analysis with stakeholders.

Data Quality Checks

[ ] Source verification: Confirmed which tables/data sources were used. Are they the right ones for this question?
[ ] Freshness: Data is current enough for the analysis. Noted the "as of" date.
[ ] Completeness: No unexpected gaps in time series or missing segments.
[ ] Null handling: Checked null rates in key columns. Nulls are handled appropriately (excluded, imputed, or flagged).
[ ] Deduplication: Confirmed no double-counting from bad joins or duplicate source records.
[ ] Filter verification: All WHERE clauses and filters are correct. No unintended exclusions.

Calculation Checks

[ ] Aggregation logic: GROUP BY includes all non-aggregated columns. Aggregation level matches the analysis grain.
[ ] Denominator correctness: Rate and percentage calculations use the right denominator. Denominators are non-zero.
[ ] Date alignment: Comparisons use the same time period length. Partial periods are excluded or noted.
[ ] Join correctness: JOIN types are appropriate (INNER vs LEFT). Many-to-many joins haven't inflated counts.
[ ] Metric definitions: Metrics match how stakeholders define them. Any deviations are noted.
[ ] Subtotals sum: Parts add up to the whole where expected. If they don't, explain why (e.g., overlap).

Reasonableness Checks

[ ] Magnitude: Numbers are in a plausible range. Revenue isn't negative. Percentages are between 0-100%.
[ ] Trend continuity: No unexplained jumps or drops in time series.
[ ] Cross-reference: Key numbers match other known sources (dashboards, previous reports, finance data).
[ ] Order of magnitude: Total revenue is in the right ballpark. User counts match known figures.
[ ] Edge cases: What happens at the boundaries? Empty segments, zero-activity periods, new entities.

Presentation Checks

[ ] Chart accuracy: Bar charts start at zero. Axes are labeled. Scales are consistent across panels.
[ ] Number formatting: Appropriate precision. Consistent currency/percentage formatting. Thousands separators where needed.
[ ] Title clarity: Titles state the insight, not just the metric. Date ranges are specified.
[ ] Caveat transparency: Known limitations and assumptions are stated explicitly.
[ ] Reproducibility: Someone else could recreate this analysis from the documentation provided.

Common Data Analysis Pitfalls

Join Explosion

The problem: A many-to-many join silently multiplies rows, inflating counts and sums.

How to detect:

-- Check row count before and after join
SELECT COUNT(*) FROM table_a;  -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id;  -- 3,500 (uh oh)

How to prevent:

Always check row counts after joins
If counts increase, investigate the join relationship (is it really 1:1 or 1:many?)
Use COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joins

Survivorship Bias

The problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.

Examples:

Analyzing user behavior of "current users" misses churned users
Looking at "companies using our product" ignores those who evaluated and left
Studying properties of "successful" outcomes without "unsuccessful" ones

How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.

Incomplete Period Comparison

The problem: Comparing a partial period to a full period.

Examples:

"January revenue is $500K vs. December's $800K" -- but January isn't over yet
"This week's signups are down" -- checked on Wednesday, comparing to a full prior week

How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.

Denominator Shifting

The problem: The denominator changes between periods, making rates incomparable.

Examples:

Conversion rate improves because you changed how you count "eligible" users
Churn rate changes because the definition of "active" was updated

How to prevent: Use consistent definitions across all compared periods. Note any definition changes.

Average of Averages

The problem: Averaging pre-computed averages gives wrong results when group sizes differ.

Example:

Group A: 100 users, average revenue $50
Group B: 10 users, average revenue $200
Wrong: Average of averages = ($50 + $200) / 2 = $125
Right: Weighted average = (100*$50 + 10*$200) / 110 = $63.64

How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.

Timezone Mismatches

The problem: Different data sources use different timezones, causing misalignment.

Examples:

Event timestamps in UTC vs. user-facing dates in local time
Daily rollups that use different cutoff times

How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.

Selection Bias in Segmentation

The problem: Segments are defined by the outcome you're measuring, creating circular logic.

Examples:

"Users who completed onboarding have higher retention" -- obviously, they self-selected
"Power users generate more revenue" -- they became power users BY generating revenue

How to prevent: Define segments based on pre-treatment characteristics, not outcomes.

Result Sanity Checking

Magnitude Checks

For any key number in your analysis, verify it passes the "smell test":

Cross-Validation Techniques

Calculate the same metric two different ways and verify they match
Spot-check individual records -- pick a few specific entities and trace their data manually
Compare to known benchmarks -- match against published dashboards, finance reports, or prior analyses
Reverse engineer -- if total revenue is X, does per-user revenue times user count approximately equal X?
Boundary checks -- what happens when you filter to a single day, a single user, or a single category? Are those micro-results sensible?

Red Flags That Warrant Investigation

Any metric that changed by more than 50% period-over-period without an obvious cause
Counts or sums that are exact round numbers (suggests a filter or default value issue)
Rates exactly at 0% or 100% (may indicate incomplete data)
Results that perfectly confirm the hypothesis (reality is usually messier)
Identical values across time periods or segments (suggests the query is ignoring a dimension)

Documentation Standards for Reproducibility

Analysis Documentation Template

Every non-trivial analysis should include:

## Analysis: [Title]

### Question
[The specific question being answered]

### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])

### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]

### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]

### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]

### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]

### SQL Queries
[All queries used, with comments]

### Caveats
- [Things the reader should know before acting on this]

Code Documentation

For any code (SQL, Python) that may be reused:

"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%

Purpose:
    Calculate monthly user retention cohorts based on first activity date.

Assumptions:
    - "Active" means at least one event in the month
    - Excludes test/internal accounts (user_type != 'internal')
    - Uses UTC dates throughout

Output:
    Cohort retention matrix with cohort_month rows and months_since_signup columns.
    Values are retention rates (0-100%).
"""

Version Control for Analyses

Save queries and code in version control (git) or a shared docs system
Note the date of the data snapshot used
If an analysis is re-run with updated data, document what changed and why
Link to prior versions of recurring analyses for trend comparison

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/business/data-analytics/data-validation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT