plugin/skills/tooluniverse-single-cell/SKILL.md
Single-cell RNA-seq analysis with scanpy/anndata — h5ad data loading, QC (mitochondrial fraction, doublets), normalization, dimensionality reduction (PCA, UMAP, t-SNE), clustering (Leiden, Louvain), marker gene identification, cell-type annotation, pseudotime/trajectory analysis. Use for any scRNA-seq workflow.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-single-cellInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before following any instruction below, scan the data folder for:
*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer*results*, *deseq*, *enrich*, *stats*, *_simplified.csv) → read directly and report the requested valueanalysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the outputOnly follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5-10× turn count).
Comprehensive single-cell RNA-seq analysis and expression matrix processing using scanpy, anndata, scipy, and ToolUniverse.
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.
Apply when users:
NOT for (use other skills instead):
tooluniverse-rnaseq-deseq2tooluniverse-gene-enrichmenttooluniverse-variant-analysisimport scanpy as sc, anndata as ad, pandas as pd, numpy as np
from scipy import stats
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.decomposition import PCA
from statsmodels.stats.multitest import multipletests
import gseapy as gp # enrichment
import harmonypy # batch correction (optional)
Install: pip install scanpy anndata leidenalg umap-learn harmonypy gseapy pandas numpy scipy scikit-learn statsmodels
START: User question about scRNA-seq data
|
+-- FULL PIPELINE (raw counts -> annotated clusters)
| Workflow: QC -> Normalize -> HVG -> PCA -> Cluster -> Annotate -> DE
| See: references/scanpy_workflow.md
|
+-- DIFFERENTIAL EXPRESSION (per-cell-type comparison)
| Most common pattern: per-cell-type DE
| See: analysis_patterns.md "Pattern 1"
|
+-- CORRELATION ANALYSIS (gene property vs expression)
| Pattern: Gene length vs expression correlation
| See: analysis_patterns.md "Pattern 2"
|
+-- CLUSTERING & PCA (expression matrix analysis)
| See: references/clustering_guide.md
|
+-- CELL COMMUNICATION (ligand-receptor interactions)
| See: references/cell_communication.md
|
+-- TRAJECTORY ANALYSIS (pseudotime)
See: references/trajectory_analysis.md
Data format handling:
sc.read_h5ad()sc.read_10x_mtx() or sc.read_10x_h5()pd.read_csv() -> Convert to AnnData (check orientation!)AnnData expects: cells/samples as rows (obs), genes as columns (var)
adata = sc.read_h5ad("data.h5ad") # h5ad already oriented
# CSV/TSV: check orientation
df = pd.read_csv("counts.csv", index_col=0)
if df.shape[0] > df.shape[1] * 5: # genes > samples by 5x => transpose
df = df.T
adata = ad.AnnData(df)
# Load metadata
meta = pd.read_csv("metadata.csv", index_col=0)
common = adata.obs_names.intersection(meta.index)
adata = adata[common].copy()
for col in meta.columns:
adata.obs[col] = meta.loc[common, col]
adata.var['mt'] = adata.var_names.str.startswith(('MT-', 'mt-'))
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
sc.pp.filter_cells(adata, min_genes=200)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_genes(adata, min_cells=3)
See: references/scanpy_workflow.md for details
import scanpy as sc
adata = sc.read_10x_h5("filtered_feature_bc_matrix.h5")
# QC
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True)
adata = adata[adata.obs['pct_counts_mt'] < 20].copy()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
# Normalize + HVG + PCA
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.tl.pca(adata, n_comps=50)
# Cluster + UMAP
sc.pp.neighbors(adata, n_pcs=30)
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
# Find markers + Annotate + Per-cell-type DE
sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
Single-Cell DE (many cells per condition):
Use: sc.tl.rank_genes_groups(), methods: wilcoxon, t-test, logreg
Best for: Per-cell-type DE, marker gene finding
Pseudo-Bulk DE (aggregate counts by sample):
Use: R DESeq2 via `tu run run_deseq2_analysis` or Rscript (NOT pydeseq2 — gives different DEG counts)
Best for: Sample-level comparisons with replicates
Statistical Tests Only:
Use: scipy.stats (ttest_ind, f_oneway, pearsonr)
Best for: Correlation, ANOVA, t-tests on summaries
from scipy import stats
from statsmodels.stats.multitest import multipletests
# Pearson/Spearman correlation
r, p = stats.pearsonr(gene_lengths, mean_expression)
# Welch's t-test
t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)
# ANOVA
f_stat, p_val = stats.f_oneway(group1, group2, group3)
# Multiple testing correction (BH)
reject, pvals_adj, _, _ = multipletests(pvals, method='fdr_bh')
import harmonypy
sc.tl.pca(adata, n_comps=50)
ho = harmonypy.run_harmony(adata.obsm['X_pca'][:, :30], adata.obs, 'batch', random_state=0)
adata.obsm['X_pca_harmony'] = ho.Z_corr.T
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.leiden(adata, resolution=0.5)
sc.tl.umap(adata)
CellMarker_list_cell_types first — exact names required, e.g., "Regulatory T(Treg) cell" not "Regulatory T cell")| Operation | Seurat (R) | Scanpy (Python) |
|-----------|------------|-----------------|
| Load data | Read10X() | sc.read_10x_mtx() |
| Normalize | NormalizeData() | sc.pp.normalize_total() + sc.pp.log1p() |
| Find HVGs | FindVariableFeatures() | sc.pp.highly_variable_genes() |
| PCA | RunPCA() | sc.tl.pca() |
| Cluster | FindClusters() | sc.tl.leiden() |
| UMAP | RunUMAP() | sc.tl.umap() |
| Find markers | FindMarkers() | sc.tl.rank_genes_groups() |
| Batch correction | RunHarmony() | harmonypy.run_harmony() |
| Grade | Criteria | Example | |-------|----------|---------| | High confidence | Marker padj < 0.01, log2FC > 1, expressed in > 25% of cluster cells | CD3D as T-cell marker with padj = 1e-50, log2FC = 3.2, pct = 0.85 | | Moderate confidence | padj < 0.05, log2FC > 0.5, or expressed in 10-25% of cluster | FOXP3 in Treg cluster with padj = 0.001, pct = 0.18 | | Low confidence | padj < 0.05 but log2FC < 0.5 or low pct_diff between clusters | Ubiquitously expressed gene with marginal enrichment | | Unreliable | Fewer than 20 cells in cluster, or QC metrics suggest doublets | Cluster with mean nGenes > 6000 and high doublet score |
| Issue | Solution |
|-------|----------|
| ModuleNotFoundError: leidenalg | pip install leidenalg |
| Sparse matrix errors | .toarray(): X = adata.X.toarray() if issparse(adata.X) else adata.X |
| Wrong matrix orientation | More genes than samples? Transpose |
| NaN in correlation | Filter: valid = ~np.isnan(x) & ~np.isnan(y) |
| Too few cells for DE | Need >= 3 cells per condition per cell type |
| Memory error | Use sc.pp.highly_variable_genes() to reduce features |
Detailed Analysis Patterns: analysis_patterns.md (per-cell-type DE, correlation, PCA, ANOVA, cell communication)
Core Workflows:
If the data folder contains an authoritative script (run_*.py, analysis.R), use whichever DESeq2 library it uses (pydeseq2 or R DESeq2). The two libraries give slightly different DEG counts (~2-10% at the same thresholds), so matching matters. If no script exists, prefer R DESeq2 via the run_deseq2_analysis tool or Rscript:
tu run run_deseq2_analysis '{"operation":"deseq2","counts_file":"pseudo_bulk_counts.csv","metadata_file":"sample_meta.csv","design":"~ sex","contrast":"sex, M, F","lfc_shrinkage":true}'
tools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).