skills/bulk-rnaseq/SKILL.md
End-to-end bulk RNA-seq orchestrator — takes raw FASTQ reads through QC and trimming (FastQC, fastp/Trim Galore), alignment and quantification (STAR, Salmon, featureCounts), assembles a gene-level counts matrix, then hands off to differential expression (pydeseq2), pathway/GSEA enrichment (pathway-enrichment), and publication figures (scientific-visualization). Use whenever the user has bulk RNA-seq reads or quant output and wants a complete, reproducible differential-expression workflow — e.g. "analyze my RNA-seq", "FASTQ to DESeq2", "run nf-core/rnaseq", "STAR/Salmon quantification", "build a counts matrix for DESeq2", or "go from reads to differentially expressed genes and enriched pathways". Routes between an nf-core/rnaseq (Nextflow) path and a standalone STAR/Salmon path, and covers experimental design, strandedness, and QC gates. For single-cell RNA-seq use the scanpy skill instead.
npx skillsauth add K-Dense-AI/claude-scientific-skills bulk-rnaseqInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill orchestrates a complete, defensible bulk RNA-seq differential-expression study, from raw sequencing reads to enriched pathways and figures. It is a router, not a reimplementation: most stages already have dedicated skills in this repo, and this skill connects them in the right order, fills the one real gap (raw reads → a gene-level counts matrix), and enforces the design and QC decisions that determine whether the final result is trustworthy.
"Defensible" means three things, applied throughout:
The pipeline is: FastQC/trim → align/quant (STAR/Salmon) → counts → DE (pydeseq2) → enrichment (pathway-enrichment) → figures.
Use this skill when the user wants to:
nf-core/rnaseq, or align/quantify with STAR, Salmon, or featureCounts.This is bulk RNA-seq (samples = biological specimens). For single-cell/nuclei data use scanpy; for the DE statistics alone use pydeseq2; for enrichment alone use pathway-enrichment.
flowchart TD
fastq["Raw FASTQ + samplesheet"] --> qc["FastQC + MultiQC"]
qc --> trim["Trim: fastp / Trim Galore"]
trim --> align["Align + quant: STAR and/or Salmon"]
align --> counts["Gene-level counts matrix"]
counts --> de["Differential expression"]
de --> enrich["Pathway / GSEA enrichment"]
de --> fig["Figures"]
enrich --> fig
nfcore["nf-core/rnaseq via nextflow skill"] -.->|"path A"| align
manual["Standalone recipes (this skill)"] -.->|"path B"| align
bridge["build_counts_matrix.py (this skill)"] -.-> counts
pydeseq2skill["pydeseq2 skill"] -.-> de
pwskill["pathway-enrichment skill"] -.-> enrich
vizskill["scientific-visualization skill"] -.-> fig
The reads → counts stage can be run two ways. They produce equivalent gene counts; choose by context, then stay on that path.
| Use Path A — nf-core/rnaseq when… | Use Path B — standalone tools when… |
|------------------------------------------|------------------------------------------|
| You want the field-standard, audited, citable pipeline with one command | You have a few samples and want to learn/inspect each step |
| Many samples, or you'll scale to HPC/cloud | No Nextflow/containers available, or a constrained environment |
| Reproducibility and a full MultiQC report matter most | You need a non-standard step the pipeline doesn't expose |
| → Drive it through the nextflow skill | → Follow references/upstream-manual.md |
When unsure, prefer Path A: nf-core/rnaseq already wires together FastQC → trimming → STAR/Salmon → quantification → tximport → MultiQC with sensible, reviewed defaults, which is the most defensible option. Path B exists for transparency and constrained setups.
Both paths converge on a gene-level counts matrix, after which the workflow is identical.
# This skill's glue (bridge + handoffs) — Python
uv pip install pytximport pandas
# Downstream skills install their own deps:
# pydeseq2 skill -> uv pip install pydeseq2
# pathway-enrichment skill -> uv pip install gseapy gprofiler-official
# Path A (nf-core): only Nextflow + a container engine are needed — see the `nextflow` skill.
# Path B (standalone tools): install via bioconda. Pin versions for reproducibility.
conda create -n rnaseq -c bioconda -c conda-forge \
fastqc fastp trim-galore "star=2.7.11b" "salmon=1.10.3" subread multiqc
Record the exact versions you use (pipeline revision, tool versions, reference genome + annotation release) — they belong in the methods section and make the analysis reproducible.
# 0. Validate the samplesheet first (catches the most common failures early)
python scripts/validate_samplesheet.py --samplesheet samplesheet.csv
# 1. Smoke-test the environment with tiny bundled data
nextflow run nf-core/rnaseq -r 3.26.0 -profile test,docker --outdir test_results
# 2. Real run: pin the revision, pick an aligner, pass a samplesheet + reference
nextflow run nf-core/rnaseq -r 3.26.0 \
-profile docker \
--input samplesheet.csv \
--genome GRCh38 \
--aligner star_salmon \
--outdir results \
-resume
nf-core/rnaseq runs tximport internally, so gene counts come out already merged — no bridge script needed. Use results/star_salmon/salmon.merged.gene_counts_length_scaled.tsv for DE. Samplesheet format, aligner choice, and outputs: references/upstream-nfcore.md. For engine/HPC/cloud/container detail, use the nextflow skill.
fastqc -o qc/ reads/*.fastq.gz # 1. QC raw reads
fastp -i s1_R1.fq.gz -I s1_R2.fq.gz \
-o s1_R1.trim.fq.gz -O s1_R2.trim.fq.gz \
--thread 4 -j s1.fastp.json # 2. Trim adapters/low-quality
salmon quant -i salmon_index -l A \
-1 s1_R1.trim.fq.gz -2 s1_R2.trim.fq.gz \
--gcBias --seqBias -p 8 -o quant/s1 # 3. Quantify (per sample)
Full recipes (FastQC, fastp/Trim Galore, STAR index+align+--quantMode GeneCounts, Salmon decoy-aware index, featureCounts, strandedness): references/upstream-manual.md.
# Path B only: assemble a gene x sample counts matrix + metadata template for PyDESeq2
python scripts/build_counts_matrix.py --from salmon \
--quant-dir quant/ --tx2gene tx2gene.tsv --output-dir counts/
# Then hand off (see the dedicated skills):
# pydeseq2: counts.csv + metadata.csv -> DE table (log2FC, padj, stat)
# pathway-enrichment: rank by `stat` (GSEA) or padj+|LFC| hit list (ORA)
# scientific-visualization / matplotlib: volcano, MA, heatmap, PCA, enrichment dotplot
Work top to bottom. Each stage names the skill or file that owns the detail. Don't skip the design/QC stages — they are where bulk RNA-seq studies most often go wrong.
scripts/validate_samplesheet.py. Rationale and rules: references/design-and-qc.md.references/design-and-qc.md.fastp or Trim Galore). Re-run FastQC to confirm. Recipes: references/upstream-manual.md (Path A does this for you).--quantMode GeneCounts) and/or Salmon (transcript quasi-mapping, decoy-aware). Determine strandedness — it is easy to get wrong and silently halves your counts. Detail: references/upstream-manual.md; pipeline params: references/upstream-nfcore.md.scripts/build_counts_matrix.py). The estimated-count and gene-ID-mapping nuances live in references/counts-and-handoff.md.pydeseq2 skill. Load counts.csv + metadata.csv, set the design (e.g. ~batch + condition), fit, and test with FDR control. Inspect the PCA and p-value histogram as QC.pathway-enrichment skill. For GSEA, rank the full gene list by the DESeq2 stat; for ORA, pass the thresholded hit list (padj < 0.05, optionally |log2FC| > 1). Map gene IDs to symbols first.scientific-visualization skill. Volcano, MA, sample-distance heatmap, PCA, and enrichment dotplots, plus the MultiQC report for the QC narrative.This is the one stage with no upstream/downstream skill, so this skill owns it. scripts/build_counts_matrix.py converts quant output into exactly what pydeseq2 expects:
--from salmon): aggregates per-sample quant.sf to gene level with pytximport using counts_from_abundance="length_scaled_tpm" (the right choice for gene-level DE), needs a tx2gene map.--from star): reads each ReadsPerGene.out.tab, selecting the column for your --strandedness (unstranded/forward/reverse).--from featurecounts): parses the combined featureCounts matrix.It writes counts.csv (genes × samples, integers) and metadata_template.csv (one row per sample) for you to fill in. Salmon/RSEM counts are estimates (non-integer); they are rounded to integers because PyDESeq2 requires integer counts — see references/counts-and-handoff.md for why this is acceptable with length_scaled_tpm and how it differs from the offset-based DESeq2+tximport route. That reference also covers Ensembl→symbol mapping (needed before enrichment) and the exact orientation PyDESeq2 wants.
These cause most wrong or irreproducible bulk RNA-seq results:
~batch + condition). See references/design-and-qc.md.-s/Salmon library type silently discards ~half the reads. Use Salmon -l A or infer strandedness, and verify the assigned-reads fraction.pathway-enrichment or "nothing is significant".-r, tool versions, and the genome/annotation release.nextflow (runs nf-core/rnaseq, Path A; HPC/cloud/containers).gget (gget ref for genome+GTF, gget info/gget search for ID mapping), database-lookup (Ensembl/NCBI), biopython/pysam (FASTA/BAM handling).pydeseq2 (the DE engine this skill hands counts to).pathway-enrichment (ORA + GSEA; its scripts/run_enrichment.py reads a DESeq2 results CSV directly).scientific-visualization, matplotlib, seaborn; scientific-writing for the methods/results narrative.scanpy (single-cell), statistical-analysis (multiple-testing depth).Read the relevant file when you need depth — each is self-contained:
references/upstream-nfcore.md — Path A: samplesheet format, --aligner/--pseudo_aligner choice, key params, the salmon.merged.gene_counts*.tsv outputs, MultiQC, and what to hand to pydeseq2.references/upstream-manual.md — Path B: FastQC, fastp/Trim Galore, STAR genome index + alignment + --quantMode GeneCounts, Salmon decoy-aware index + quant, featureCounts, and how to determine strandedness.references/counts-and-handoff.md — turning quant output into PyDESeq2-ready counts.csv/metadata.csv (pytximport, STAR column selection, featureCounts), the integer/estimated-count nuance, Ensembl→symbol mapping, and the DE→enrichment rank/hit-list recipe.references/design-and-qc.md — experimental design (replication, batch, confounding, design formulas) and QC-metric interpretation (mapping rate, duplication, rRNA, complexity, PCA/outliers) — the defensible-pipeline backbone.development
Create, edit, analyze, or convert Excel spreadsheets (.xlsx, .xlsm) where the workbook file is the primary deliverable. Use for formulas, formatting, financial models, multi-sheet workbooks, and tabular cleanup exported to Excel. Also applies to .csv/.tsv when the user wants spreadsheet output. Do NOT use for Word documents, HTML reports, standalone Python scripts, database pipelines, or Google Sheets API work.
testing
Run structured What-If scenario analysis with 4–6 branch possibility exploration (best, likely, worst, wild card, contrarian, second-order). Use when the user asks speculative what-if questions about uncertain futures, strategic forks, contingency planning, or stress-testing a decision before committing.
development
Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.
development
Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.