skills/tooluniverse-single-cell/SKILL.md
Single-cell RNA-seq analysis with scanpy/anndata — h5ad data loading, scRNA-seq quality control and QC gating (n_genes_by_counts, total_counts, mitochondrial percent / pct_counts_mt, pct_counts_ribo, doublet detection with Scrublet/scDblFinder, ambient RNA / SoupX awareness, empty-droplet filtering, MAD-based thresholds), normalization, dimensionality reduction (PCA, UMAP, t-SNE), clustering (Leiden, Louvain), marker gene identification, cell-type annotation, pseudotime/trajectory analysis. Use for any scRNA-seq workflow, including deciding which cells to filter, flag, or investigate before downstream analysis.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-single-cellInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before following any instruction below, scan the data folder for:
*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer*results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested valueanalysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the outputOnly follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.
Apply when users:
NOT for (use other skills instead):
tooluniverse-rnaseq-deseq2tooluniverse-gene-enrichmenttooluniverse-variant-analysisimport scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp # enrichment
import harmonypy # batch correction (optional)
Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels
START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
| Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
| See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
| Most common pattern: per-cell-type DE
| See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
| Pattern: Gene length vs expression correlation
| See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
| See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
| See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
See: references/trajectory_analysis.md
Data format handling:
sc.read_h5ad()sc.read_10x_mtx() or sc.read_10x_h5()pd.read_csv() -> Convert to AnnData (check orientation!)AnnData expects: cells/samples as rows (obs), genes as columns (var)
adata = sc.read_h5ad("data.h5ad") # h5ad already oriented
# CSV/TSV: check orientation
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5: # genes > samples by 5x => transpose
df = df.T
adata = ad.AnnData(df)
# Load metadata
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
QC gating decides which cells and genes are real before normalization, clustering, or DE. Skipping or rushing it propagates silently: doublets become fake "intermediate" states, ambient RNA smears markers across clusters, and empty droplets inflate cell counts. Never report a filtered cell count without the gates applied, and never report cutoffs you did not actually run.
HONEST EXECUTION: QC runs scanpy/AnnData via Bash/Python. If scanpy is not
installed, do NOT fabricate metrics — print the install plan and stop:
python scripts/scrna_qc.py --install-plan (exits 0, lists what is missing,
suggests pip install scanpy anndata scrublet).
vn = adata.var_names.str.upper()
adata.var['mt'] = vn.str.startswith('MT-') # mitochondrial
adata.var['ribo'] = vn.str.startswith(('RPS', 'RPL')) # ribosomal protein
adata.var['hb'] = vn.str.contains(r'^HB[^P]', regex=True) # hemoglobin (RBC)
sc.pp.calculate_qc_metrics(
adata, qc_vars=['mt', 'ribo', 'hb'],
percent_top=None, # REQUIRED for small gene panels (<500) — else IndexError
log1p=True, inplace=True)
Key metrics in adata.obs: n_genes_by_counts, total_counts,
pct_counts_mt, pct_counts_ribo, pct_counts_hb.
pct_counts_mt -> dying / stressed cell. A ruptured membrane lets
cytoplasmic mRNA leak out while mito transcripts stay trapped, enriching the
captured RNA for mito. Cutoff is tissue-dependent (cardiomyocytes/hepatocytes
are mito-rich at baseline — a blanket 10% would discard healthy cells).n_genes_by_counts / total_counts -> empty droplet or debris
(only ambient RNA captured; few genes, low depth).pct_counts_hb -> RBC/blood contamination in solid tissue.Hardcoded cutoffs (mt<5%, n_genes<2500) are a starting point only; they
break on mito-rich tissues and on shallow vs deep libraries. Prefer a
MAD-based rule (robust to the very outliers you are removing): flag cells
nmadsmedian-absolute-deviations from the median. Usenmads=5on log1p counts/genes (both tails),nmads=3upper-only onpct_counts_mt, and pair mito with a biological ceiling so a uniformly degraded sample doesn't pass. Always visualize distributions first (violin +total_countsvspct_counts_mtscatter — dying cells sit in the low-count/high-mito corner).
Run the helper (computes metrics + MAD gating, reports per-step removals):
python scripts/scrna_qc.py data.h5ad --doublets # or --install-plan first
sc.pp.scrublet, scanpy >=1.10) or scDblFinder (R).
Run per sample before merging; flag-cluster-drop (doublets form bridge
clusters). expected_doublet_rate ~0.8%/1,000 cells recovered (10x).| QC metric | Direction | Typical concern | Suggested action |
|-----------|-----------|-----------------|------------------|
| pct_counts_mt | high | Dying / stressed cell (membrane rupture) | Filter (MAD upper + tissue-aware ceiling; raise ceiling for mito-rich tissue) |
| n_genes_by_counts | very low | Empty droplet / debris | Filter (min_genes ~200 + low-tail MAD) |
| total_counts | very low | Shallow / failed capture | Filter (low-tail MAD; check barcode-rank knee) |
| n_genes_by_counts / total_counts | very high | Doublet (two cells in one droplet) | Flag, run Scrublet/scDblFinder, then drop — don't hard-cap on counts alone |
| predicted_doublet (Scrublet) | True | Multiplet | Filter per sample; cluster-then-drop if unsure |
| pct_counts_hb | high | RBC / blood contamination | Filter in non-blood tissue; investigate in blood |
| pct_counts_ribo | very high/low | Low-complexity / stressed, or cell-type signal | Flag / investigate (ribo is cell-type-specific; rarely a hard filter) |
| markers ubiquitous across clusters | — | Ambient RNA contamination | Investigate — run SoupX/DecontX, do not silently proceed |
| many cells, low median counts | — | Empty droplets not removed | Investigate — apply EmptyDrops / knee, re-filter |
Full reasoning, MAD code, Scrublet/SoupX/EmptyDrops recipes, and order of operations: references/scrna_qc.md. Inline pipeline QC: references/scanpy_workflow.md Phase 2.
import scanpy as sc
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# Normalize + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
# Cluster + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
# Find markers + Annotate + Per-cell-type DE
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
Single-Cell DE (many cells per condition):
Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
Best for: Per-cell-type DE, marker gene finding
Pseudo-Bulk DE (aggregate counts by sample):
Use: R DESeq2 via `tu run run_deseq2_analysis` or Rscript (NOT pydeseq2 — gives different DEG counts)
Best for: Sample-level comparisons with replicates
Statistical Tests Only:
Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
Best for: Correlation, ANOVA, t-tests on summaries
from scipy import stats
from statsmodels.stats.multitest import multipletests
# Pearson/Spearman correlation
r, p = stats.pearsonr(gene_lengths, mean_expression)
# Welch's t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
# Multiple testing correction (BH)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
CellMarker_list_cell_types first — exact names required, e.g., "Regulatory T(Treg) cell" not "Regulatory T cell")| Operation | Seurat (R) | Scanpy (Python) |
|-----------|------------|-----------------|
| Load data | Read10X() | sc.read_10x_mtx() |
| Normalize | NormalizeData() | sc.pp.normalize_total() + sc.pp.log1p() |
| Find HVGs | FindVariableFeatures() | sc.pp.highly_variable_genes() |
| PCA | RunPCA() | sc.tl.pca() |
| Cluster | FindClusters() | sc.tl.leiden() |
| UMAP | RunUMAP() | sc.tl.umap() |
| Find markers | FindMarkers() | sc.tl.rank_genes_groups() |
| Batch correction | RunHarmony() | harmonypy.run_harmony() |
| Grade | Criteria | Example | |-------|----------|---------| | High confidence | Marker padj < 0.01, log2FC > 1, expressed in > 25% of cluster cells | CD3D as T-cell marker with padj = 1e-50, log2FC = 3.2, pct = 0.85 | | Moderate confidence | padj < 0.05, log2FC > 0.5, or expressed in 10-25% of cluster | FOXP3 in Treg cluster with padj = 0.001, pct = 0.18 | | Low confidence | padj < 0.05 but log2FC < 0.5 or low pct_diff between clusters | Ubiquitously expressed gene with marginal enrichment | | Unreliable | Fewer than 20 cells in cluster, or QC metrics suggest doublets | Cluster with mean nGenes > 6000 and high doublet score |
| Issue | Solution |
|-------|----------|
| ModuleNotFoundError: leidenalg | pip install leidenalg |
| Sparse matrix errors | .toarray(): X = adata.X.toarray() if issparse(adata.X) else adata.X |
| Wrong matrix orientation | More genes than samples? Transpose |
| NaN in correlation | Filter: valid = ~np.isnan(x) & ~np.isnan(y) |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Memory error | Use sc.pp.highly_variable_genes() to reduce features |
Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)
Core Workflows:
If the data folder contains an authoritative script (run_*.py, analysis.R), use whichever DESeq2 library it uses (pydeseq2 or R DESeq2). The two libraries give slightly different DEG counts (~2-10% at the same thresholds), so matching matters. If no script exists, prefer R DESeq2 via the run_deseq2_analysis tool or Rscript:
tu run run_deseq2_analysis '{"operation":"deseq2","counts_file":"pseudo_bulk_counts.csv","metadata_file":"sample_meta.csv","design":"~ sex","contrast":"sex, M, F","lfc_shrinkage":true}'
tools
Post-market safety surveillance and recall/adverse-event RETRIEVAL across the full spectrum of FDA-regulated products that are NOT covered by the drug-AE signal skills: medical devices, food / dietary supplements / cosmetics, veterinary drugs, and drug supply (shortages). Orchestrates openFDA endpoints (MAUDE device adverse events + device recalls + 510(k), CAERS food/supplement/ cosmetic adverse events, veterinary adverse events, drug shortages, and cross-product enforcement/recall reports). USE WHEN the user asks: "are there adverse events for [device / pacemaker / infusion pump / insulin pump]", "device recalls for [firm/product]", "supplement / vitamin / cosmetic adverse reactions", "is [drug] in shortage", "what injectables are on shortage", "veterinary / animal adverse events for [drug] in [dog/cat/horse]", "food recall for listeria", "MAUDE report for [device]", "CAERS reactions for [brand]". DO NOT USE for drug adverse-event SIGNAL detection or disproportionality (PRR / ROR / IC) or drug-AE association scoring — that is `tooluniverse-pharmacovigilance` / `tooluniverse-adverse-event-detection`. This skill is multi-product surveillance and retrieval, not drug-AE statistical signal mining.
tools
--- name: tooluniverse-phewas description: Cross-ancestry / cross-biobank phenome-wide association (PheWAS) and replication. Given ONE variant (rsID) or ONE gene, look up every phenotype it associates with across European/UK (UKB-TOPMed), Finnish (FinnGen), Japanese (BioBank Japan), and Taiwanese (TPMI) biobanks, plus exome-wide gene-burden PheWAS (Genebass), then judge whether an association replicates across ancestries or is population-specific. Use whenever the user asks "what else is this va
tools
Dereplicate a putative natural product and assign its chemical taxonomy. Use to answer "is [compound] a known natural product", "what microbe/organism produces [compound]", "what chemical class is [compound]", "dereplicate this metabolite (by formula/exact mass/InChIKey/SMILES)", or "classify this molecule into ChemOnt". Searches NPAtlas for known microbial natural products (producing organism + literature reference), assigns the ChemOnt kingdom→superclass→class→subclass hierarchy via ClassyFire, resolves systematic IUPAC names to structure via OPSIN, and cross-references identity in PubChem. NOT for general drug/compound identity or ADMET (use tooluniverse-chemical-compound-retrieval / tooluniverse-small-molecule-discovery) and NOT for metabolomics pathway/enrichment analysis (use tooluniverse-metabolomics skills).
tools
Genome-ASSEMBLY discovery, QC, and replicon mapping for any organism (bacteria, archaea, fungi, and beyond) using NCBI Datasets. Resolves an organism name or taxid to assemblies, picks the reference/representative or best-quality assembly, pulls assembly QC metrics (total length, contig/scaffold N50, contig count, GC%, assembly level, RefSeq category), enumerates chromosomes and plasmids via per-replicon sequence reports, and compares candidate assemblies on quality. Use for "what genomes are available for [organism]", "assembly stats / N50 / GC content for [GCF_/GCA_ accession]", "how many plasmids does [strain] have", "compare assemblies for [species]", "find the reference genome for [taxon]", "is this assembly Complete Genome or just contigs". NOT for gene-level orthology/synteny (use tooluniverse-comparative-genomics), plant gene structure (use tooluniverse-plant-genomics), de novo assembly from raw reads (no tool exists), or taxonomy-only name/lineage lookups.