business/data-analytics/data-validation/SKILL.md
QA an analysis before sharing with stakeholders — methodology checks, accuracy verification, and bias detection. Use when reviewing an analysis for errors, checking for survivorship bias, validating aggregation logic, or preparing documentation for reproducibility.
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library data-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pre-delivery QA checklist, common data analysis pitfalls, result sanity checking, and documentation standards for reproducibility.
Run through this checklist before sharing any analysis with stakeholders.
The problem: A many-to-many join silently multiplies rows, inflating counts and sums.
How to detect:
-- Check row count before and after join
SELECT COUNT(*) FROM table_a; -- 1,000
SELECT COUNT(*) FROM table_a a JOIN table_b b ON a.id = b.a_id; -- 3,500 (uh oh)
How to prevent:
COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joinsThe problem: Analyzing only entities that exist today, ignoring those that were deleted, churned, or failed.
Examples:
How to prevent: Ask "who is NOT in this dataset?" before drawing conclusions.
The problem: Comparing a partial period to a full period.
Examples:
How to prevent: Always filter to complete periods, or compare same-day-of-month / same-number-of-days.
The problem: The denominator changes between periods, making rates incomparable.
Examples:
How to prevent: Use consistent definitions across all compared periods. Note any definition changes.
The problem: Averaging pre-computed averages gives wrong results when group sizes differ.
Example:
How to prevent: Always aggregate from raw data. Never average pre-aggregated averages.
The problem: Different data sources use different timezones, causing misalignment.
Examples:
How to prevent: Standardize all timestamps to a single timezone (UTC recommended) before analysis. Document the timezone used.
The problem: Segments are defined by the outcome you're measuring, creating circular logic.
Examples:
How to prevent: Define segments based on pre-treatment characteristics, not outcomes.
For any key number in your analysis, verify it passes the "smell test":
| Metric Type | Sanity Check | |---|---| | User counts | Does this match known MAU/DAU figures? | | Revenue | Is this in the right order of magnitude vs. known ARR? | | Conversion rates | Is this between 0% and 100%? Does it match dashboard figures? | | Growth rates | Is 50%+ MoM growth realistic, or is there a data issue? | | Averages | Is the average reasonable given what you know about the distribution? | | Percentages | Do segment percentages sum to ~100%? |
Every non-trivial analysis should include:
## Analysis: [Title]
### Question
[The specific question being answered]
### Data Sources
- Table: [schema.table_name] (as of [date])
- Table: [schema.other_table] (as of [date])
- File: [filename] (source: [where it came from])
### Definitions
- [Metric A]: [Exactly how it's calculated]
- [Segment X]: [Exactly how membership is determined]
- [Time period]: [Start date] to [end date], [timezone]
### Methodology
1. [Step 1 of the analysis approach]
2. [Step 2]
3. [Step 3]
### Assumptions and Limitations
- [Assumption 1 and why it's reasonable]
- [Limitation 1 and its potential impact on conclusions]
### Key Findings
1. [Finding 1 with supporting evidence]
2. [Finding 2 with supporting evidence]
### SQL Queries
[All queries used, with comments]
### Caveats
- [Things the reader should know before acting on this]
For any code (SQL, Python) that may be reused:
"""
Analysis: Monthly Cohort Retention
Author: [Name]
Date: [Date]
Data Source: events table, users table
Last Validated: [Date] -- results matched dashboard within 2%
Purpose:
Calculate monthly user retention cohorts based on first activity date.
Assumptions:
- "Active" means at least one event in the month
- Excludes test/internal accounts (user_type != 'internal')
- Uses UTC dates throughout
Output:
Cohort retention matrix with cohort_month rows and months_since_signup columns.
Values are retention rates (0-100%).
"""
testing
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.