skills/variant-analysis/SKILL.md
ToolUniverse workflow — Variant Analysis
npx skillsauth add lamm-mit/scienceclaw variant-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-ready VCF processing and variant annotation skill combining local bioinformatics computation with ToolUniverse database integration. Designed to answer bioinformatics analysis questions about VCF data, mutation classification, variant filtering, and clinical annotation.
Triggers:
Example Questions:
| Capability | Description | |-----------|-------------| | VCF Parsing | Pure Python + cyvcf2 parsers. VCF 4.x, gzipped, multi-sample, SNV/indel/SV | | Mutation Classification | Maps SO terms, SnpEff ANN, VEP CSQ, GATK Funcotator to standard types | | VAF Extraction | Handles AF, AD, AO/RO, NR/NV, INFO AF formats | | Filtering | VAF, depth, quality, PASS, variant type, mutation type, consequence, chromosome, SV size | | Statistics | Ti/Tv ratio, per-sample VAF/depth stats, mutation type distribution, SV size distribution | | Annotation | MyVariant.info (aggregates ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen) | | SV/CNV Analysis | gnomAD SV population frequencies, DGVa/dbVar known SVs, ClinGen dosage sensitivity | | Clinical Interpretation | ACMG/ClinGen CNV pathogenicity classification using haploinsufficiency/triplosensitivity scores | | DataFrame | Convert to pandas for advanced analytics | | Reporting | Markdown reports with tables and statistics, SV clinical reports |
Input VCF File (SNVs/indels or SVs)
|
v
Phase 1: Parse VCF
|-- Pure Python parser (any VCF 4.x)
|-- cyvcf2 parser (faster, C-based)
|-- Extract: CHROM, POS, REF, ALT, QUAL, FILTER, INFO, FORMAT, samples
|-- Extract per-sample: GT, VAF, depth
|-- Extract annotations from INFO (ANN, CSQ, FUNCOTATION)
|-- Detect variant class: SNV/indel vs SV/CNV
|
v
Phase 2: Classify Variants
|-- Variant type: SNV, INS, DEL, MNV, COMPLEX, SV
|-- Mutation type: missense, nonsense, synonymous, frameshift, splice, etc.
|-- Impact: HIGH, MODERATE, LOW, MODIFIER
|-- SV type: DEL, DUP, INV, BND, CNV (if structural variant)
|
v
Phase 3: Apply Filters
|-- VAF range (min/max)
|-- Read depth minimum
|-- Quality threshold
|-- PASS only
|-- Variant/mutation type inclusion/exclusion
|-- Consequence exclusion (intronic, intergenic)
|-- Population frequency range
|-- Chromosome selection
|-- SV size range (for structural variants)
|
v
Phase 4: Compute Statistics
|-- Variant type distribution
|-- Mutation type distribution
|-- Impact distribution
|-- Chromosome distribution
|-- Ti/Tv ratio (for SNVs)
|-- Per-sample VAF/depth stats
|-- Gene mutation counts
|-- SV size distribution (for structural variants)
|
v
Phase 5: Annotate with ToolUniverse (optional)
|-- MyVariant.info: ClinVar, dbSNP, gnomAD, CADD, SIFT, PolyPhen
|-- dbSNP: Population frequencies, gene associations
|-- gnomAD: Population allele frequencies
|-- Ensembl VEP: Consequence prediction
|
v
Phase 6: Generate Report / Answer Question
|-- Markdown report with tables
|-- Direct answer to specific question
|-- DataFrame for downstream analysis
|
v
Phase 7: Structural Variant & CNV Analysis (if SV/CNV detected)
|-- Annotate with gnomAD SV population frequencies
|-- Query DGVa/dbVar for known SVs (Ensembl)
|-- Identify affected genes
|-- Query ClinGen dosage sensitivity (HI/TS scores)
|-- Classify pathogenicity (Pathogenic/Likely Pathogenic/VUS/Benign)
|-- Generate SV clinical report with ACMG/ClinGen guidelines
Use pandas for:
Use python_implementation tools for:
Key functions:
vcf_data = parse_vcf("input.vcf") # Pure Python (always works)
vcf_data = parse_vcf_cyvcf2("input.vcf") # Fast C-based (if installed)
df = variants_to_dataframe(vcf_data.variants, sample="TUMOR") # For pandas
Automatic classification from annotations:
Mutation types supported: missense, nonsense, synonymous, frameshift, splice_site, splice_region, inframe_insertion, inframe_deletion, intronic, intergenic, UTR_5, UTR_3, upstream, downstream, stop_lost, start_lost
See references/mutation_classification_guide.md for full details
Common filtering patterns:
# Somatic-like variants
criteria = FilterCriteria(
min_vaf=0.05, max_vaf=0.95,
min_depth=20, pass_only=True,
exclude_consequences=["intronic", "intergenic", "upstream", "downstream"]
)
# High-confidence germline
criteria = FilterCriteria(
min_vaf=0.25, min_depth=30, pass_only=True,
chromosomes=["1", "2", ..., "22", "X", "Y"]
)
# Rare pathogenic candidates
criteria = FilterCriteria(
min_depth=20, pass_only=True,
mutation_types=["missense", "nonsense", "frameshift"]
)
See references/vcf_filtering.md for all filter options
Use pandas for:
Use python_implementation for:
When to use ToolUniverse annotation tools:
Best practices:
Key tools:
MyVariant_query_variants: Batch annotation (ClinVar, dbSNP, gnomAD, CADD)dbsnp_get_variant_by_rsid: Population frequenciesgnomad_get_variant: Basic variant metadataEnsemblVEP_annotate_rsid: Consequence predictionSee references/annotation_guide.md for detailed examples
Report includes:
When VCF contains SV calls (SVTYPE=DEL/DUP/INV/BND):
clingen = ClinGen_dosage_by_gene(gene_symbol="BRCA1")
# Returns: haploinsufficiency_score, triplosensitivity_score
gnomad_sv = gnomad_get_sv_by_gene(gene_symbol="BRCA1")
# Returns: SVs with AF, AC, AN
ClinGen dosage score interpretation:
See references/sv_cnv_analysis.md for full SV workflow
Question: "What fraction of variants with VAF < X are annotated as Y mutations?"
result = answer_vaf_mutation_fraction(
vcf_path="input.vcf",
max_vaf=0.3,
mutation_type="missense",
sample="TUMOR"
)
# Returns: fraction, total_below_vaf, matching_mutation_type
Question: "What is the difference in mutation frequency between cohorts?"
result = answer_cohort_comparison(
vcf_paths=["cohort1.vcf", "cohort2.vcf"],
mutation_type="missense",
cohort_names=["Treatment", "Control"]
)
# Returns: cohorts, frequency_difference
Question: "After filtering X, how many Y remain?"
result = answer_non_reference_after_filter(
vcf_path="input.vcf",
exclude_intronic_intergenic=True
)
# Returns: total_input, non_reference, remaining
| Tool | When to Use | Parameters | Response |
|------|------------|------------|----------|
| MyVariant_query_variants | Batch annotation | query (rsID/HGVS) | ClinVar, dbSNP, gnomAD, CADD |
| dbsnp_get_variant_by_rsid | Population frequencies | rsid | Frequencies, clinical significance |
| gnomad_get_variant | gnomAD metadata | variant_id (CHR-POS-REF-ALT) | Basic variant info |
| EnsemblVEP_annotate_rsid | Consequence prediction | variant_id (rsID) | Transcript impact |
| Tool | When to Use | Parameters | Response |
|------|------------|------------|----------|
| gnomad_get_sv_by_gene | SV population frequency | gene_symbol | SVs with AF, AC, AN |
| gnomad_get_sv_by_region | Regional SV search | chrom, start, end | SVs in region |
| ClinGen_dosage_by_gene | Dosage sensitivity | gene_symbol | HI/TS scores, disease |
| ClinGen_dosage_region_search | Dosage-sensitive genes in region | chromosome, start, end | All genes with HI/TS scores |
| ensembl_get_structural_variants | Known SVs from DGVa/dbVar | chrom, start, end, species | Clinical significance |
See references/annotation_guide.md for detailed tool usage examples
Parse VCF, compute statistics, generate report.
report = variant_analysis_pipeline("input.vcf", output_file="report.md")
Parse VCF, apply multi-criteria filter, compute statistics on filtered set.
report = variant_analysis_pipeline(
vcf_path="input.vcf",
filters=FilterCriteria(min_vaf=0.1, min_depth=20, pass_only=True),
output_file="filtered_report.md"
)
Parse VCF, annotate top variants with ClinVar/gnomAD/CADD, generate clinical report.
report = variant_analysis_pipeline(
vcf_path="input.vcf",
annotate=True,
max_annotate=50,
output_file="annotated_report.md"
)
Parse VCF, apply specific filters, compute targeted statistics to answer precise questions.
result = answer_vaf_mutation_fraction(
vcf_path="input.vcf",
max_vaf=0.3,
mutation_type="missense"
)
Parse multiple VCFs, compare mutation frequencies across cohorts.
result = answer_cohort_comparison(
vcf_paths=["cohort1.vcf", "cohort2.vcf"],
mutation_type="missense"
)
Use pandas when:
Use python_implementation when:
Best approach: Use python_implementation for parsing/classification, then convert to DataFrame for custom analysis:
# Parse and classify
vcf_data = parse_vcf("input.vcf")
passing, failing = filter_variants(vcf_data.variants, criteria)
# Convert to DataFrame for custom analysis
df = variants_to_dataframe(passing, sample="TUMOR")
# Now use pandas
missense_high_vaf = df[(df['mutation_type'] == 'missense') & (df['vaf'] >= 0.3)]
See QUICK_START.md for:
tools
Onboard and manage Paperclip AI for research-paper knowledge and agent orchestration
development
Perform AI-powered web searches with real-time information using Perplexity models via LiteLLM and OpenRouter. This skill should be used when conducting web searches for current information, finding recent scientific literature, getting grounded answers with source citations, or accessing information beyond the model knowledge cutoff. Provides access to multiple Perplexity models including Sonar Pro, Sonar Pro Search (advanced agentic search), and Sonar Reasoning Pro through a single OpenRouter API key.
testing
Generate a structured scientific PDF report from a JSON description. Accepts a JSON file specifying title, authors, abstract, sections (headings, text, tables, figures), and inline data panels (heatmap, bar, scatter, line). Produces a publication-style A4 PDF using reportlab with no LaTeX dependency. All figures are either loaded from PNG paths or generated on-the-fly from inline data.
development
Execute arbitrary Python code and return stdout. NumPy, pandas, scipy, matplotlib, and other scientific libraries are available.