scientific-skills/Data Analysis/anndata/SKILL.md
Data structure for annotated matrices in single-cell analysis; use when reading/writing .h5ad (or zarr) and exchanging data with the scverse ecosystem.
npx skillsauth add aipoch/medical-research-skills anndataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use AnnData when you need to:
.h5ad (or zarr) for downstream tools.X (data matrix) plus aligned annotations: obs, var, uns, and multi-dimensional slots (obsm, varm, obsp, varp), plus layers and optional raw..h5ad and zarr, plus common genomics formats (e.g., 10x, loom, mtx, csv).backed="r") for large datasets.ad.concat(...) with join/merge strategies and batch labeling; experimental lazy collections.Reference notes: the original material mentions additional guides under
references/(e.g.,references/data_structure.md,references/io_operations.md,references/concatenation.md,references/manipulation.md,references/best_practices.md) for deeper explanations of each topic.
anndata (latest compatible with your environment; install via pip/uv)numpypandasscipy (recommended for sparse matrices)scanpymuontorch (for deep learning) and anndata experimental loader utilitiesA complete runnable example that creates an AnnData object, writes/reads .h5ad, subsets, concatenates batches, and demonstrates backed mode.
import numpy as np
import pandas as pd
import anndata as ad
from scipy.sparse import csr_matrix
# ----------------------------
# 1) Create an AnnData object
# ----------------------------
rng = np.random.default_rng(0)
n_cells, n_genes = 100, 500
X = rng.poisson(1.0, size=(n_cells, n_genes)).astype(np.float32)
obs = pd.DataFrame(
{
"cell_type": (["T cell", "B cell"] * (n_cells // 2)),
"sample": (["A", "B"] * (n_cells // 2)),
"quality_score": rng.random(n_cells),
},
index=[f"cell_{i}" for i in range(n_cells)],
)
var = pd.DataFrame(
{"gene_name": [f"Gene_{j}" for j in range(n_genes)]},
index=[f"ENSG{j:05d}" for j in range(n_genes)],
)
adata = ad.AnnData(X=X, obs=obs, var=var)
# Use sparse storage for typical count-like matrices
adata.X = csr_matrix(adata.X)
# Convert string columns to categoricals to reduce memory and speed up ops
adata.strings_to_categoricals()
print(f"Created: {adata.n_obs} obs × {adata.n_vars} vars")
# ----------------------------
# 2) Write and read .h5ad
# ----------------------------
adata.write_h5ad("example.h5ad", compression="gzip")
adata2 = ad.read_h5ad("example.h5ad")
print(f"Reloaded: {adata2.n_obs} obs × {adata2.n_vars} vars")
# ----------------------------
# 3) Subset (keeps alignment)
# ----------------------------
t_cells = adata2[adata2.obs["cell_type"] == "T cell", :]
high_quality = adata2[adata2.obs["quality_score"] > 0.8, :]
print(f"T cells: {t_cells.n_obs}")
print(f"High quality: {high_quality.n_obs}")
# ----------------------------
# 4) Concatenate batches
# ----------------------------
adata_a = adata2[adata2.obs["sample"] == "A", :].copy()
adata_b = adata2[adata2.obs["sample"] == "B", :].copy()
combined = ad.concat(
[adata_a, adata_b],
axis=0, # concatenate observations (cells)
join="inner", # keep shared variables
label="batch", # add a column in .obs
keys=["A", "B"], # batch labels
)
print(combined.obs["batch"].value_counts().to_dict())
# ----------------------------
# 5) Backed mode for large files
# ----------------------------
adata_backed = ad.read_h5ad("example.h5ad", backed="r")
# Slicing in backed mode is metadata-friendly; load to memory when needed:
subset_mem = adata_backed[:10, :50].to_memory()
print(f"Backed subset loaded: {subset_mem.shape}")
X: primary data matrix (dense numpy.ndarray or sparse scipy.sparse), shape (n_obs, n_vars).obs: per-observation metadata (pandas.DataFrame), indexed by obs_names (e.g., cell IDs).var: per-variable metadata (pandas.DataFrame), indexed by var_names (e.g., gene IDs).layers: named alternative matrices aligned to X (e.g., "counts", "log1p").obsm / varm: multi-dimensional embeddings aligned to obs/var (e.g., PCA, UMAP coordinates).obsp / varp: pairwise graphs/matrices (e.g., kNN graph in obsp["connectivities"]).uns: unstructured metadata (dict-like), often used for parameters and plotting configs.raw (optional): snapshot of unfiltered/untransformed data for reproducibility.adata_subset = adata[mask, :] typically returns a view (lightweight reference)..copy() when you need an independent object (e.g., before in-place modifications).ad.read_h5ad(path, backed="r") keeps the matrix on disk and loads data lazily..to_memory() when you need in-memory computation.ad.concat([...], axis=0) stacks observations; axis=1 stacks variables.join="inner" keeps intersection of variables; join="outer" unions variables (may introduce missing values).label + keys records dataset/batch provenance in .obs[label]..uns and annotation columns are handled (choose based on your data governance needs).csr_matrix) for count-like data.adata.strings_to_categoricals())..h5ad (e.g., compression="gzip") to reduce storage; consider zarr for chunked/cloud-friendly access.tools
Generates complete conventional oncology bulk-transcriptome biomarker and hub-gene research designs from a user-provided cancer type and study direction. Always use this skill whenever a user wants to design, plan, or build a tumor bioinformatics study centered on differential expression, prognostic filtering or risk modeling, PPI-based hub-gene prioritization, diagnostic/prognostic evaluation, clinical association, immune infiltration context, methylation context, and optional tissue or cell validation. Covers five study patterns (signature-first prognostic workflow, hub-gene-first biomarker workflow, hybrid signature-to-hub workflow, immune-context biomarker workflow, translational validation workflow) and always outputs four workload configs (Lite / Standard / Advanced / Publication+) with recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, publication upgrade path...
development
Generates complete conventional non-oncology bioinformatics research designs from a user-provided disease context, process-related gene family or biological theme, and validation direction. Use when a study centers on multi-dataset bulk transcriptome integration, DEG analysis, process-gene intersection, enrichment analysis, GSEA, PPI hub-gene prioritization, TF/miRNA regulatory networks, ROC-based biomarker evaluation, and immune infiltration analysis. Covers five study patterns (process-DEG discovery, enrichment/GSEA interpretation, hub-gene prioritization, regulatory-network and immune interpretation, multi-layer public validation) and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.
tools
Plans confounder control, variable adjustment logic, and bias mitigation strategies at the protocol stage for clinical, epidemiologic, translational, observational, and biomarker studies. Always use this skill when a user needs to identify major confounders, decide which variables should or should not be adjusted for, compare matching/stratification/weighting approaches, anticipate selection or measurement bias, or pressure-test a study design before execution. Focus on bias sensing, causal structure awareness, variable-role classification, and critical design review rather than generic statistical advice.
testing
Generates complete comparative network-toxicology research designs from a user-provided exposure pair, shared toxic phenotype, and validation direction. Use when a study centers on two related exposures under one outcome and needs target collection, shared-vs-specific target decomposition, enrichment, PPI hub prioritization, docking, optional transcriptomic cross-checks, and conservative mechanistic synthesis. Covers five study patterns and always outputs Lite / Standard / Advanced / Publication+ with a recommended primary plan, stepwise workflow, figure plan, validation hierarchy, minimal executable version, publication upgrade path, and strictly verified literature retrieval.