machine-learning/atlas-mapping/SKILL.md
Maps query single-cell data to reference atlases using scArches transfer learning with scVI and scANVI models. Transfers cell type labels without retraining on combined data. Use when annotating new single-cell datasets using pre-trained reference models.
npx skillsauth add GPTomics/bioSkills bio-machine-learning-atlas-mappingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: anndata 0.10+, scanpy 1.10+, scvi-tools 1.1+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Map my scRNA-seq data onto a reference atlas" -> Transfer cell type labels from a pre-trained reference model to query cells using architectural surgery (scArches) without retraining.
scvi.model.SCVI.load_query_data() -> get_latent_representation() -> scanpy.tl.ingest()Goal: Map query single-cell data onto a pre-trained reference model to obtain a shared latent embedding.
Approach: Load a pre-trained scVI model, prepare query data with matching gene sets, then perform surgical fine-tuning that updates only query-specific parameters.
import scvi
import scanpy as sc
# Load pre-trained reference model
adata_ref = sc.read_h5ad('reference.h5ad')
# Model must have been saved with save_anndata=True
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_model = scvi.model.SCVI.load('reference_model/', adata=adata_ref)
# Prepare query data
adata_query = sc.read_h5ad('query.h5ad')
# Subset to reference genes
adata_query = adata_query[:, adata_ref.var_names].copy()
# Set up query AnnData using reference setup
scvi.model.SCVI.prepare_query_anndata(adata_query, ref_model)
# Load query into model (creates "surgical" fine-tuned model)
query_model = scvi.model.SCVI.load_query_data(adata_query, ref_model)
# Surgical training: update only query-specific parameters
# weight_decay=0.0: Standard for surgery; prevents reference drift
query_model.train(max_epochs=200, plan_kwargs={'weight_decay': 0.0})
# Get latent representation
adata_query.obsm['X_scVI'] = query_model.get_latent_representation()
Goal: Transfer cell type labels from a labeled reference atlas to unlabeled query data.
Approach: Train a semi-supervised scANVI model on the reference, then map query cells via surgical fine-tuning and predict labels with confidence scores.
import scvi
import scanpy as sc
# Reference with cell type labels
adata_ref = sc.read_h5ad('reference_labeled.h5ad')
scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch')
ref_vae = scvi.model.SCVI(adata_ref, n_latent=30)
ref_vae.train(max_epochs=100)
# Convert to scANVI (semi-supervised)
scvi.model.SCANVI.setup_anndata(adata_ref, layer='counts', batch_key='batch', labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi = scvi.model.SCANVI.from_scvi_model(ref_vae, labels_key='cell_type', unlabeled_category='Unknown')
ref_scanvi.train(max_epochs=50)
ref_scanvi.save('reference_scanvi/')
# Map query data
adata_query = sc.read_h5ad('query.h5ad')
adata_query = adata_query[:, adata_ref.var_names].copy()
scvi.model.SCANVI.prepare_query_anndata(adata_query, ref_scanvi)
query_scanvi = scvi.model.SCANVI.load_query_data(adata_query, ref_scanvi)
query_scanvi.train(max_epochs=100, plan_kwargs={'weight_decay': 0.0})
# Transfer labels
adata_query.obs['predicted_cell_type'] = query_scanvi.predict()
adata_query.obsm['X_scANVI'] = query_scanvi.get_latent_representation()
Goal: Assess reliability of transferred labels and flag cells that may represent novel types.
Approach: Extract soft prediction probabilities from the scANVI model and identify low-confidence assignments below a threshold.
# Get prediction probabilities
soft_predictions = query_scanvi.predict(soft=True)
adata_query.obs['prediction_confidence'] = soft_predictions.max(axis=1)
# Flag low-confidence predictions
# confidence < 0.5: May be novel cell type or poor mapping
low_conf = adata_query.obs['prediction_confidence'] < 0.5
print(f'Low confidence predictions: {low_conf.sum()} ({low_conf.mean():.1%})')
Goal: Visualize reference and query cells together to assess integration quality.
Approach: Concatenate reference and query datasets, compute UMAP from the shared latent representation, and color by dataset and cell type.
import scanpy as sc
# Combine reference and query for visualization
adata_combined = adata_ref.concatenate(adata_query, batch_key='dataset', batch_categories=['reference', 'query'])
# Use latent space for neighbors/UMAP
sc.pp.neighbors(adata_combined, use_rep='X_scVI')
sc.tl.umap(adata_combined)
sc.pl.umap(adata_combined, color=['dataset', 'cell_type'], save='_transfer.png')
| Atlas | Model | URL | |-------|-------|-----| | Human Lung Cell Atlas | scANVI | cellxgene.cziscience.com | | Tabula Sapiens | scVI | tabula-sapiens-portal.ds.czbiohub.org | | Mouse Cell Atlas | scVI | bis.zju.edu.cn/MCA |
| Parameter | Surgical | Full Retrain | Notes | |-----------|----------|--------------|-------| | weight_decay | 0.0 | 0.001 | 0.0 preserves reference | | max_epochs | 100-200 | 200-400 | Less for surgery | | early_stopping | True | True | Prevents overfitting |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.