skills/cell-biology/single-cell-annotation/SKILL.md
Best practices for single-cell RNA-seq cell type annotation including marker-based, reference-based, and automated classification approaches.
npx skillsauth add jaechang-hits/scicraft single-cell-annotationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Short Description: Best practices for annotating cell types in single-cell RNA-seq data using marker-based, automated, and reference-based approaches.
Authors: Distilled from "Single-cell best practices" by Luecken, M.D. et al.
Affiliations: Helmholtz Munich, Wellcome Sanger Institute, Harvard Medical School, and contributors
Version: 1.0
Last Updated: January 2025
License: CC BY 4.0
Commercial Use: ✅ Allowed
Source: https://www.sc-best-practices.org/cellular_structure/annotation.html
Citation: Luecken, M.D., Theis, F.J. et al. (2023). Current best practices in single-cell RNA-seq analysis: a tutorial. Molecular Systems Biology.
Cell type annotation is the process of assigning cell type labels to clusters or individual cells in single-cell RNA-seq data. This guide covers three main approaches and their practical implementation.
A cell type is a stable identity defined by a developmental trajectory and core marker gene program (e.g., CD4+ T cell, hepatocyte). A cell state is a transient condition (activated, cycling, stressed) overlaid on a cell type. Annotation should target cell types first; states are attributes that may further subdivide a type but should not be conflated with type identity.
Marker genes are genes whose expression is enriched in a specific cell type relative to other cells in the same tissue context. Reliable annotation uses panels of multiple markers (typically 3-5 per type) rather than a single gene, because expression is noisy in droplet-based scRNA-seq and many markers are shared across related types. Markers come in two flavors: canonical (literature-derived, e.g., CD3D for T cells) and data-derived (from differential expression on the dataset).
A reference atlas is a previously annotated dataset (e.g., Human Cell Atlas, Tabula Sapiens) used to project labels onto a new "query" dataset. Label transfer methods (scArches, scANVI, Azimuth, SingleR) align query cells into the reference latent space and assign the nearest neighbor's label. Quality of transfer depends on tissue match, technology match (e.g., 10x v3 vs. Smart-seq2), and species match.
Use this tree to choose an annotation approach:
Do you have a well-characterized tissue
with a high-quality reference atlas?
│
┌─────────────┴─────────────┐
│ │
YES NO
│ │
▼ ▼
Is this a standard tissue Are you studying
(PBMC, lung, gut) with a novel cell types or
pre-trained classifier? exploratory data?
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ │ │ │
YES NO YES NO
│ │ │ │
▼ ▼ ▼ ▼
Automated Reference- Manual marker Manual +
(CellTypist) based based automated
(scArches, (Scanpy, cross-check
Azimuth, Seurat)
SingleR)
| Scenario | Approach | Primary Tool | Validation | |----------|----------|--------------|------------| | Standard human PBMC, large dataset (>100k cells) | Automated | CellTypist | Spot-check with manual markers | | Well-characterized tissue (lung, kidney, brain) | Reference-based label transfer | scArches / Azimuth | Marker consistency on top clusters | | Novel/rare tissue, no good reference | Manual marker-based | Scanpy / Seurat | Hierarchical, broad-to-fine | | Cross-species (e.g., zebrafish) | Manual markers + ortholog mapping | Scanpy + custom panel | Compare to closest reference species | | Developmental / continuous trajectory | Reference-based with state-aware model | scANVI / scArches | Trajectory coherence + markers | | Disease tissue with known perturbation | Manual + automated cross-check | CellTypist + Scanpy | Confirm disease-specific states separately |
Identify cell types by examining expression of known marker genes in each cluster.
Tools: Scanpy, Seurat Best for: Small datasets, novel cell types, high confidence needs
Use pre-trained classifiers to automatically assign cell type labels.
Tools: CellTypist, scAnnotate Best for: Standard tissues, quick preliminary annotation, large datasets
Transfer labels from annotated reference datasets to your query data.
Tools: scArches, scANVI, Azimuth, SingleR Best for: Well-characterized tissues, integration with public data
# Scanpy example
import scanpy as sc
# Calculate marker genes for clusters
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
# Visualize top markers
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
# Plot known markers
markers = {
'T cells': ['CD3D', 'CD3E', 'CD4', 'CD8A'],
'B cells': ['CD19', 'MS4A1', 'CD79A'],
'Monocytes': ['CD14', 'FCGR3A', 'LYZ'],
'NK cells': ['NCAM1', 'NKG7', 'GNLY']
}
sc.pl.dotplot(adata, markers, groupby='leiden')
# CellTypist example (fast, accurate for immune cells)
import celltypist
from celltypist import models
# Download immune cell model
model = models.Model.load(model='Immune_All_Low.pkl')
# Predict cell types
predictions = celltypist.annotate(adata, model=model, majority_voting=True)
adata = predictions.to_adata()
# scArches example for label transfer
import scarches as sca
# Load pre-trained reference model
model = sca.models.SCANVI.load_query_data(
adata=adata, # Your query data
reference_model="path/to/reference_model"
)
# Transfer labels
model.train(max_epochs=100)
adata.obs['transferred_labels'] = model.predict()
| Scenario | Recommended Tool | Why | |----------|------------------|-----| | Immune cells (human) | CellTypist | Pre-trained on large immune atlases | | Mouse tissues | scArches + Mouse Cell Atlas | Comprehensive mouse reference | | Novel cell types | Manual + Scanpy/Seurat | Need domain expertise | | Large datasets (>100k cells) | CellTypist | Fast, scalable | | Cross-species | Manual markers | Limited reference transfer | | Developmental data | scArches | Handles continuous states |
Issue: All clusters look similar → Increase clustering resolution, check if data is normalized
Issue: Too many small clusters → Decrease resolution, merge similar clusters based on markers
Issue: Automated tool gives inconsistent results → Check input normalization, try multiple tools, fall back to manual
Issue: Can't find clear markers for cluster → May be transitional state, doublet, or low-quality cells
Issue: Reference transfer fails → Check batch correction, ensure overlapping gene sets, verify tissue match
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.