skills/domains/biomedical/genomics-analysis-guide/SKILL.md
Workflows for RNA-seq, GWAS, and variant calling in genomic research
npx skillsauth add wentorai/research-plugins genomics-analysis-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Genomic data analysis is the computational backbone of modern molecular biology. From identifying disease-associated variants through Genome-Wide Association Studies (GWAS) to quantifying gene expression with RNA-seq, these workflows transform raw sequencing data into biological insights that drive discoveries in medicine, agriculture, and evolutionary biology.
This guide covers the three most common genomic analysis workflows: RNA-seq differential expression analysis, GWAS for variant-trait associations, and variant calling from whole-genome sequencing (WGS) data. Each workflow is described with tool recommendations, command-line examples, and downstream analysis steps in R and Python.
The emphasis is on reproducibility and best practices. Genomic analyses involve many sequential steps, and errors in early stages propagate through the entire pipeline. Following standardized workflows -- like those from the Broad Institute, ENCODE, and Bioconductor -- reduces the risk of methodological errors.
Raw FASTQ files
|
v
[Quality Control] --> FastQC, MultiQC
|
v
[Trimming] --> Trimmomatic, fastp
|
v
[Alignment] --> STAR, HISAT2
|
v
[Quantification] --> featureCounts, Salmon
|
v
[Differential Expression] --> DESeq2, edgeR
|
v
[Pathway Analysis] --> clusterProfiler, GSEA
# Run FastQC on all FASTQ files
fastqc -t 8 -o qc_results/ raw_data/*.fastq.gz
# Aggregate QC reports
multiqc qc_results/ -o multiqc_report/
# fastp for quality trimming and adapter removal
fastp \
--in1 sample_R1.fastq.gz \
--in2 sample_R2.fastq.gz \
--out1 trimmed_R1.fastq.gz \
--out2 trimmed_R2.fastq.gz \
--detect_adapter_for_pe \
--thread 8 \
--html fastp_report.html
# Build genome index (one time)
STAR --runMode genomeGenerate \
--genomeDir star_index/ \
--genomeFastaFiles genome.fa \
--sjdbGTFfile annotations.gtf \
--runThreadN 16
# Align reads
STAR --runMode alignReads \
--genomeDir star_index/ \
--readFilesIn trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--outFileNamePrefix sample_ \
--runThreadN 16
library(DESeq2)
# Load count matrix and sample info
counts <- read.csv("gene_counts.csv", row.names = 1)
coldata <- read.csv("sample_info.csv", row.names = 1)
# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = ~ condition
)
# Filter low-count genes
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
# Run differential expression
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "treated", "control"),
alpha = 0.05)
# Summary
summary(res)
# Export significant genes
sig_genes <- subset(as.data.frame(res), padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(sig_genes, "significant_genes.csv")
Genotype Data (VCF/PLINK)
|
v
[Quality Control] --> Sample/variant filtering
|
v
[Population Stratification] --> PCA
|
v
[Association Testing] --> PLINK2, REGENIE
|
v
[Multiple Testing Correction] --> Bonferroni, FDR
|
v
[Visualization] --> Manhattan plot, QQ plot
# Sample QC
plink2 \
--bfile dataset \
--mind 0.05 \ # Remove samples with >5% missing
--geno 0.02 \ # Remove variants with >2% missing
--maf 0.01 \ # Remove rare variants (MAF < 1%)
--hwe 1e-6 \ # HWE filter
--make-bed \
--out dataset_qc
# LD pruning for PCA
plink2 \
--bfile dataset_qc \
--indep-pairwise 50 5 0.2 \
--out pruned
# PCA for population stratification
plink2 \
--bfile dataset_qc \
--extract pruned.prune.in \
--pca 10 \
--out pca_results
# Linear/logistic regression with covariates
plink2 \
--bfile dataset_qc \
--glm \
--pheno phenotypes.txt \
--covar pca_results.eigenvec \
--covar-col-nums 3-12 \
--out gwas_results
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def manhattan_plot(gwas_file, output='manhattan.pdf'):
df = pd.read_csv(gwas_file, sep='\t')
df['-log10p'] = -np.log10(df['P'])
# Assign cumulative positions
df = df.sort_values(['CHR', 'BP'])
df['pos_cum'] = 0
offset = 0
for chrom in df['CHR'].unique():
mask = df['CHR'] == chrom
df.loc[mask, 'pos_cum'] = df.loc[mask, 'BP'] + offset
offset = df.loc[mask, 'pos_cum'].max()
fig, ax = plt.subplots(figsize=(16, 5))
colors = ['#3B82F6', '#94A3B8']
for i, chrom in enumerate(df['CHR'].unique()):
subset = df[df['CHR'] == chrom]
ax.scatter(subset['pos_cum'], subset['-log10p'],
s=2, color=colors[i % 2], alpha=0.7)
ax.axhline(-np.log10(5e-8), color='red', linestyle='--', linewidth=0.8)
ax.set_xlabel('Chromosome')
ax.set_ylabel('-log10(p-value)')
fig.savefig(output, dpi=300, bbox_inches='tight')
# Mark duplicates
gatk MarkDuplicates \
-I aligned.bam \
-O dedup.bam \
-M metrics.txt
# Base quality score recalibration
gatk BaseRecalibrator \
-I dedup.bam \
-R reference.fa \
--known-sites dbsnp.vcf \
-O recal_table.txt
gatk ApplyBQSR \
-I dedup.bam \
-R reference.fa \
--bqsr-recal-file recal_table.txt \
-O recal.bam
# Call variants
gatk HaplotypeCaller \
-I recal.bam \
-R reference.fa \
-O variants.g.vcf \
-ERC GVCF
tools
10 document processing skills. Trigger: extracting text from PDFs, parsing references, document Q&A. Design: parsing pipelines (GROBID, marker) and structured extraction tools.
documentation
Guide to tldraw for infinite canvas whiteboarding and diagram creation
testing
Create graphical abstracts, schematic diagrams, and scientific illustrations
documentation
Create UML diagrams and architecture visualizations with PlantUML