expression-matrix/counts-ingest/SKILL.md
Imports gene expression count matrices from featureCounts, HTSeq, STAR ReadsPerGene, Salmon/kallisto via tximport or tximeta, RSEM, 10X Genomics MTX/H5, AnnData H5AD, and RDS. Handles silent-miscounting traps (featureCounts -p v2.0.2 API break, STAR strandedness column choice, salmon NumReads-sum without tximport, RSEM non-integer expected_count, GENCODE _PAR_Y suffix, zero-length-transcript TPM divide-by-zero), and encodes the tximport countsFromAbundance decision tree with the "lengthScaledTPM is not TPM" warning. Use when assembling a gene-by-sample count matrix from aligner or quantifier output, importing salmon/kallisto for DESeq2 vs limma-voom, choosing strandedness column for STAR, debugging zero-count panics, or building tx2gene mapping.
npx skillsauth add GPTomics/bioSkills bio-expression-matrix-counts-ingestInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pandas 2.2+, numpy 1.26+, scanpy 1.10+, anndata 0.10+, tximport 1.30+, tximeta 1.20+, GenomicFeatures 1.54+, Subread/featureCounts 2.0.6+ (post-v2.0.2 API), STAR 2.7.10+, Salmon 1.10+, kallisto 0.48+, RSEM 1.3.3+, HTSeq 2.0+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Load my featureCounts / Salmon / STAR output into a count matrix" -> Parse the per-sample quantification, strip metadata, choose the correct strandedness or NumReads column, optionally apply length-bias correction via tximport, and return a gene-by-sample matrix appropriate for the downstream DE tool.
lengthScaledTPM is NOT TPM; the naming has fooled manytximport(..., countsFromAbundance='lengthScaledTPM') returns a count-scale matrix (sum to library size) with the gene-length bias removed, intended as input to DE tools that cannot accept offsets (limma-voom). The word "TPM" in the option name has misled users into reporting these values as normalized abundance -- they are not.
The decision tree for countsFromAbundance (Soneson, Love, Robinson 2015 F1000Res 4:1521):
| Option | Returns | Use with |
|--------|---------|----------|
| 'no' (default) | Raw NumReads; length matrix passed as DESeq2/edgeR offset | DESeq2 via DESeqDataSetFromTximport, edgeR via DGEList -- the correct path for DGE |
| 'scaledTPM' | TPM scaled to library size | DGE when downstream tool cannot accept offsets (rare) |
| 'lengthScaledTPM' | TPM scaled by avg transcript length AND library size | limma-voom for DGE; recommended modern default when offsets unavailable |
| 'dtuScaledTPM' | Per-transcript scaling by median isoform length | Differential Transcript Usage (DRIMSeq, DEXSeq) ONLY; requires txOut=TRUE |
Two adjacent traps with the same flavor of silent miscount:
featureCounts -p since v2.0.2 (March 2021) needs --countReadPairs. Pre-v2.0.2, -p flagged paired-end AND counted pairs as fragments. Post-v2.0.2, -p only flags paired-end -- pairs are NOT counted as one fragment unless --countReadPairs is added. Old scripts run on new installs produce ~2x the expected counts (one count per mate). No warning. Always pass -p --countReadPairs together for paired-end.
STAR ReadsPerGene.out.tab column choice depends on library protocol. Column 2 is unstranded; column 3 is forward-stranded; column 4 is reverse-stranded. Illumina TruSeq Stranded (the dominant kit, dUTP-based) is reverse-stranded -- column 4. Reading column 3 for TruSeq throws away ~95% of reads. Reading column 2 conflates antisense expression. Verify with RSeQC infer_experiment.py on a few BAMs before assembling the matrix.
| Source | Output structure | Read into | Caveats |
|--------|------------------|-----------|---------|
| featureCounts (Liao 2014) | TSV with 6 metadata cols + N sample cols | read.delim (R), pd.read_csv(comment='#') (Python) | -p/--countReadPairs v2.0.2 break; -O double-counts; -s strandedness |
| HTSeq-count (Anders 2015) | Per-sample 2-col TSV with __ summary lines | Per-sample read, drop __ rows, concat | --mode matters (union vs intersection-strict vs intersection-nonempty); -s strandedness |
| STAR --quantMode GeneCounts | Per-sample 4-col ReadsPerGene.out.tab | Skip first 4 summary rows; pick column by strandedness | Column 4 = TruSeq Stranded |
| Salmon quant.sf | Per-sample 5-col TSV (Name, Length, EffectiveLength, TPM, NumReads) | tximport (recommended) or manual NumReads sum (length-biased) | Selective alignment is the default since Salmon 1.0.0 (Srivastava 2020); the --validateMappings flag is now a no-op |
| kallisto abundance.tsv / .h5 | Per-sample 5-col TSV; bootstraps in .h5 | tximport (gene-level) or sleuth (transcript-level with bootstrap variance) | kallisto quant -b 100 for sleuth |
| RSEM *.genes.results / *.isoforms.results | Per-sample TSV; expected_count is non-integer | tximport (type='rsem') with round() via DESeq2; or manual | Zero-length transcripts cause lengths > 0 is not TRUE |
| 10X Genomics CellRanger | filtered_feature_bc_matrix/ MTX dir OR .h5 | scanpy read_10x_mtx / read_10x_h5; Seurat Read10X | Single-cell convention: cells in rows |
| AnnData .h5ad | scverse single-file binary | scanpy read_h5ad; in R via zellkonverter | .X vs .layers['counts'] vs .raw.X semantics |
| RDS (from R) | R-serialized object | pyreadr (Python); base R readRDS | For Seurat objects, convert first |
| Scenario | Recommended approach |
|----------|---------------------|
| Bulk RNA-seq with featureCounts paired-end | featureCounts -p --countReadPairs -s 2 -t exon -g gene_id (TruSeq Stranded) |
| Bulk RNA-seq with STAR | --quantMode GeneCounts; read column 4 for TruSeq Stranded |
| Salmon/kallisto -> gene-level DGE with DESeq2 | tximport(type='salmon', tx2gene) + DESeqDataSetFromTximport() -- offsets handled |
| Salmon -> gene-level DGE with limma-voom | tximport(..., countsFromAbundance='lengthScaledTPM') -- no offset |
| Salmon/kallisto -> DTE (differential transcript expression) | tximport(..., txOut=TRUE, countsFromAbundance='dtuScaledTPM') for DRIMSeq/DEXSeq; OR edgeR::catchSalmon() for the Baldoni 2024 framework |
| kallisto with bootstraps -> sleuth (uncertainty-aware DE) | sleuth_prep() directly; do not go through tximport |
| RSEM -> DESeq2 | tximport(files, type='rsem', txIn=FALSE) + DESeqDataSetFromTximport() |
| Want automatic annotation provenance | tximeta (Love 2020 PLoS Comp Biol 16:e1007664) |
| 3'-tagged library (10x bulk, QuantSeq) | countsFromAbundance='no' WITHOUT length offset -- length bias negligible |
| 10X single-cell | sc.read_10x_h5() or Read10X_h5() |
| Strandedness unknown | RSeQC infer_experiment.py on 1-2 BAMs BEFORE re-running quantification |
-p and -O TrapsGoal: Run featureCounts correctly on paired-end stranded RNA-seq, then read the output without inheriting the metadata columns.
Approach: CLI invocation with -p --countReadPairs -s <strandedness>, plus optional -O --fraction for overlapping-gene handling; parse with pandas/read.delim stripping the 6 metadata columns.
featureCounts -T 8 -p --countReadPairs -s 2 \
-t exon -g gene_id \
-a annotation.gtf \
-o featurecounts.txt \
sample1.bam sample2.bam sample3.bam
import pandas as pd
fc = pd.read_csv('featurecounts.txt', sep='\t', comment='#')
counts = fc.set_index('Geneid').iloc[:, 5:]
counts.columns = [c.replace('.bam', '').split('/')[-1] for c in counts.columns]
fc <- read.delim('featurecounts.txt', comment.char = '#', row.names = 1)
counts <- fc[, 6:ncol(fc)]
colnames(counts) <- gsub('.*/|\\.bam$', '', colnames(counts))
| Flag | Meaning | Default | When to flip |
|------|---------|---------|--------------|
| -p | Input is paired-end | off | Always for paired-end -- AND add --countReadPairs |
| --countReadPairs | Count fragment (pair) as one | off in v2.0.2+ | Always for paired-end (post-v2.0.2 API change) |
| -s 0|1|2 | Strandedness | 0 (unstranded) | -s 2 for TruSeq Stranded; -s 1 for forward kits |
| -O | Allow multi-overlap | off | Off for typical DGE; on creates double-counts in overlapping genes |
| -M | Count multi-mappers | off | Off for DGE; on with --fraction for fractional counting |
| -t | Feature type | exon | Almost always exon |
| -g | Group attribute | gene_id | gene_id for gene-level; transcript_id is wrong (use Salmon for transcript-level) |
--quantMode GeneCountsGoal: Build a count matrix from STAR's per-sample 4-column output, choosing the correct strandedness column.
Approach: Skip the first 4 summary rows; index column 0 (gene ID); pick column for strandedness; concat across samples.
import pandas as pd
from pathlib import Path
def load_star_genecounts(filepaths, strandedness='reverse'):
'''Load STAR ReadsPerGene.out.tab files.
File columns (1-indexed): 1=gene_id, 2=unstranded, 3=forward, 4=reverse.
After read_csv with index_col=0, the three remaining columns are 0=unstranded, 1=forward, 2=reverse.
Illumina TruSeq Stranded is 'reverse'.
'''
col_map = {'unstranded': 0, 'forward': 1, 'reverse': 2}
col_idx = col_map[strandedness]
dfs = {}
for fp in filepaths:
sample = Path(fp).name.replace('_ReadsPerGene.out.tab', '')
df = pd.read_csv(fp, sep='\t', header=None, index_col=0)
dfs[sample] = df.iloc[4:, col_idx]
return pd.DataFrame(dfs)
Strandedness verification: for a stranded library, the "wrong" strand column total should be <5% of the "right" strand column total. If comparable, the library was unstranded or there was a kit / config mix-up.
infer_experiment.py -r annotation.bed -i sample.bam
Goal: Import transcript-level Salmon or kallisto quantifications to gene-level counts with length-bias correction.
Approach: Build a tx2gene mapping; call tximport() with the right type and countsFromAbundance; hand the result to DESeq2 or limma-voom.
library(tximport)
library(DESeq2)
tx2gene <- read.csv('tx2gene.csv')
files <- file.path('salmon_out', samples$id, 'quant.sf')
names(files) <- samples$id
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
The naive alternative -- summing NumReads to gene level -- is wrong when isoform usage varies across samples. NumReads is normalized against EffectiveLength per transcript; sum-then-DE introduces a length-by-condition bias indistinguishable from differential expression. tximport handles this by carrying the per-sample average transcript length matrix as a DESeq2/edgeR offset.
For limma-voom (no offset mechanism):
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene,
countsFromAbundance = 'lengthScaledTPM')
y <- DGEList(counts = txi$counts)
v <- voom(y, design, plot = TRUE)
For DTU (DRIMSeq, DEXSeq), use dtuScaledTPM with txOut=TRUE.
For 3'-tagged libraries (10x Chromium 3', QuantSeq), length bias is negligible; use countsFromAbundance='no' WITHOUT the length offset (tximeta's argument or manual disable).
Goal: Import Salmon/kallisto with automatic linkage to the exact annotation release used in the index.
Approach: tximeta (Love, Soneson, Hickey et al. 2020 PLoS Comp Biol 16:e1007664) inspects the Salmon index hash and pulls matching annotation and metadata; the resulting SummarizedExperiment carries the provenance.
library(tximeta)
coldata <- data.frame(
names = samples$id,
files = file.path('salmon_out', samples$id, 'quant.sf'),
condition = samples$condition
)
se <- tximeta(coldata)
gse <- summarizeToGene(se)
library(DESeq2)
dds <- DESeqDataSet(gse, design = ~ condition)
For new projects, tximeta is the modern preference over hand-managed tx2gene -- it eliminates a class of "wrong annotation" bugs.
import pandas as pd
from pathlib import Path
def load_htseq_counts(filepaths):
'''Load HTSeq count files, dropping the __no_feature etc. summary rows.'''
dfs = {}
for fp in filepaths:
sample = Path(fp).stem.replace('_counts', '')
df = pd.read_csv(fp, sep='\t', header=None, index_col=0,
names=['gene', 'count'])
df = df[~df.index.str.startswith('__')]
dfs[sample] = df['count']
return pd.DataFrame(dfs)
HTSeq overlap modes (Anders, Pyl, Huber 2015 Bioinformatics 31:166):
| Mode | Behavior |
|------|----------|
| union (default) | Read assigned to union of overlapping features; >1 gene -> __ambiguous |
| intersection-strict | Every base of read must overlap the same single feature |
| intersection-nonempty | Intersection across positions; nonempty -> that gene wins |
-s yes|no|reverse: TruSeq Stranded is reverse. The --mode and -s flags must match the library; mismatches are silent miscounts.
The __no_feature and __ambiguous totals are QC signals. If __no_feature > 30%: annotation incomplete, wrong reference, or strandedness misspecified. If __ambiguous > 10%: many overlapping gene annotations (common with comprehensive GTFs); consider intersection-nonempty.
library(tximport)
library(DESeq2)
files <- file.path('rsem_out', paste0(samples$id, '.genes.results'))
names(files) <- samples$id
txi <- tximport(files, type = 'rsem', txIn = FALSE, txOut = FALSE)
txi$length[txi$length == 0] <- 1
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
RSEM expected_count is non-integer (EM-derived). DESeq2 requires integers, but DESeqDataSetFromTximport rounds appropriately. The txi$length == 0 substitution handles the "Error: all(lengths > 0) is not TRUE" panic from rRNA-filtered or zero-length transcripts.
Avoid round() + DESeqDataSetFromMatrix -- that path loses the length correction.
kallisto quant -i index -o sample1_out -b 100 -t 8 read1_1.fq.gz read1_2.fq.gz
library(sleuth)
s2c <- data.frame(sample = samples$id,
condition = samples$condition,
path = file.path('kallisto_out', samples$id))
so <- sleuth_prep(s2c, ~ condition,
target_mapping = tx2gene,
aggregation_column = 'gene_id',
gene_mode = TRUE)
so <- sleuth_fit(so, ~ condition, 'full')
so <- sleuth_fit(so, ~ 1, 'reduced')
so <- sleuth_lrt(so, 'reduced', 'full')
results <- sleuth_results(so, 'reduced:full')
Sleuth (Pimentel et al. 2017 Nat Methods 14:687) uses the 100 bootstrap replicates from kallisto to decouple BIOLOGICAL variance (between-replicate) from INFERENTIAL variance (within-replicate, from EM uncertainty). For transcript-level analyses where quantification uncertainty matters (similar isoforms, low coverage), sleuth's response error linear model is more conservative than DESeq2 on plain counts. For well-quantified gene-level DGE, sleuth and DESeq2-via-tximport converge.
Selective alignment (Srivastava 2020 Genome Biol 21:239) -- which combines fast pseudo-mapping with traditional alignment of seed extensions to improve quantification accuracy for transcripts with sequence similarity -- has been the default since Salmon 1.0.0. The historical --validateMappings flag that explicitly requested it is now a deprecated no-op; passing it does nothing.
salmon quant -i index -l A \
-1 read_1.fq.gz -2 read_2.fq.gz \
--gcBias --seqBias \
-p 8 -o sample_out
--gcBias corrects fragment GC bias; --seqBias corrects random hexamer priming bias. Both are recommended for any Salmon run from RNA-seq with biological condition variation. They affect EffectiveLength estimates and propagate through tximport.
import scanpy as sc
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
library(Seurat)
mat <- Read10X(data.dir = 'filtered_feature_bc_matrix/')
mat <- Read10X_h5('filtered_feature_bc_matrix.h5')
10X convention: cells in rows (AnnData) or cells in columns (Seurat after Read10X). The R/Python convention differs by transpose.
Goal: Drop biotypes that distort downstream normalization or are irrelevant to the question.
Approach: Filter on gene_biotype from the GTF (Ensembl) or gene_type (GENCODE).
library(rtracklayer)
gtf <- import('Homo_sapiens.GRCh38.110.gtf.gz')
gene_info <- as.data.frame(gtf[gtf$type == 'gene',
c('gene_id', 'gene_name', 'gene_biotype')])
keep_biotypes <- c('protein_coding', 'lncRNA',
'IG_C_gene', 'IG_D_gene', 'IG_J_gene', 'IG_V_gene',
'TR_C_gene', 'TR_D_gene', 'TR_J_gene', 'TR_V_gene')
keep_genes <- gene_info$gene_id[gene_info$gene_biotype %in% keep_biotypes]
counts_filt <- counts[rownames(counts) %in% keep_genes, ]
Mt-encoded protein-coding genes (MT-CO1, MT-ND1, ...) are a judgment call: include for tissue-specific work (heart, muscle); drop when mitochondrial fraction varies with cell stress and could confound normalization.
In single-cell, mitochondrial percentage is a STANDARD QC metric (cells with high %mito are stressed/dying); see single-cell/preprocessing.
| Difference | Ensembl | GENCODE |
|------------|---------|---------|
| chromosome naming | 1, 2, ..., MT | chr1, chr2, ..., chrM |
| PAR gene encoding (releases 25-43) | Once on chrX | Both chrX and chrY with _PAR_Y suffix on chrY copies |
| PAR gene encoding (releases 44+ / Ensembl 110+) | Once on chrX | chrY copies get their own ENSG accessions; _PAR_Y retired |
| Subset releases | Single release | basic (high-confidence subset) and comprehensive (everything) |
CRITICAL: code that strips Ensembl version suffixes via sub('\\..*', '', x) ALSO strips the _PAR_Y tag in GENCODE 25-43, collapsing the chrY duplicate onto the chrX gene -- silently introducing duplicate row indices. Use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x) to preserve.
BAM files aligned to GENCODE (chr1) cannot be quantified against Ensembl GTF (1) without rename. Mismatched naming causes __no_feature to dominate.
library(edgeR)
y <- DGEList(counts = counts, group = group)
keep <- filterByExpr(y, design = model.matrix(~ condition, coldata))
y <- y[keep, , keep.lib.sizes = FALSE]
min_counts, min_samples = 10, 3
expressed = (counts >= min_counts).sum(axis=1) >= min_samples
counts_filt = counts.loc[expressed]
See differential-expression/edger-basics for filterByExpr semantics. DESeq2 has automatic independent filtering at results() time; manual pre-filter is speed-only.
Trigger: Pipeline written against featureCounts pre-v2.0.2 used -p only; rerun on Subread 2.0.6 produces counts ~2x expected.
Mechanism: Post-v2.0.2 -p flags paired-end but does NOT count pairs as fragments. Each mate is counted separately.
Symptom: Library sizes ~2x what RNA-seq QC reports; downstream CPM compressed; "more reads than expected" panic.
Fix: Add --countReadPairs. Re-run featureCounts.
Trigger: TruSeq Stranded library quantified via STAR; user reads column 3 (forward); ~95% of reads dropped.
Mechanism: TruSeq Stranded is reverse-stranded (column 4). Column 3 is forward-stranded; for a reverse library, almost no reads align in the forward orientation.
Symptom: Library sizes ~5% of expected; almost no DE detectable; very low gene detection rate.
Fix: Read column 4. Verify with infer_experiment.py.
Trigger: Salmon output read without tximport; per-transcript NumReads summed to gene level via groupby.
Mechanism: NumReads is normalized against per-transcript EffectiveLength. Summing ignores that. If treatment shifts isoform usage from short to long, the gene appears upregulated even with constant total mRNA.
Symptom: DE genes overlap with known isoform-switching genes (e.g., during development); fold changes don't replicate at the protein level.
Fix: Use tximport (or catchSalmon in edgeR) to carry the length matrix as a DE offset.
Trigger: tximport(files, type='rsem') errors out with "all(lengths > 0) is not TRUE".
Mechanism: RSEM reports effective_length = 0 for certain very-short or filtered transcripts.
Symptom: Pipeline halts at import.
Fix: txi$length[txi$length == 0] <- 1 after tximport; or filter tx2gene to exclude affected transcripts.
_PAR_Y stripped, chrY duplicates lostTrigger: GENCODE v40 quantification; rownames(counts) <- sub('\\..*', '', rownames(counts)); duplicate row indices.
Mechanism: Default regex strips _PAR_Y along with the version suffix; chrY PAR copies collapse onto chrX gene IDs.
Symptom: Duplicate row warnings; sample-specific counts double for affected genes; downstream aggregate(... ~ rownames) adds them.
Fix: Use the regex that preserves _PAR_Y: sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x). Or upgrade to GENCODE 44+ where the issue is gone.
Trigger: Mouse RNA-seq with --mode intersection-strict; many overlapping-gene reads dropped; library size 30% lower than featureCounts on the same data.
Mechanism: intersection-strict requires every base to overlap the same feature; reads crossing exon-intron boundaries or overlapping gene boundaries get dropped.
Symptom: Lower count totals than other quantifiers; __no_feature and __ambiguous high.
Fix: --mode union (default) or --mode intersection-nonempty for richer recovery.
| Error / symptom | Cause | Fix |
|-----------------|-------|-----|
| Error: all(lengths > 0) is not TRUE | RSEM zero-length transcripts | txi$length[txi$length == 0] <- 1 |
| Duplicate row indices after version strip | _PAR_Y collapsed | Use the preserving regex |
| __no_feature >30% | Wrong reference; wrong strandedness; chrN naming mismatch | Verify reference + strandedness; check chr vs N naming |
| counts matrix should be integers | Salmon/RSEM raw counts to DESeq2 | Use DESeqDataSetFromTximport; or round (loses length correction) |
| Library sizes ~2x expected for paired-end | featureCounts v2.0.2 -p without --countReadPairs | Add --countReadPairs |
| TPM column has Inf values | Zero-length features | Filter Length > 0 before TPM computation |
| Salmon TPM doesn't sum to 1e6 | Pre-summed across samples or filtered subset | Recompute TPM after filter; or use original abundance column |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.