skills/data-handling/SKILL.md
Data handling best practices for R and Python data science analysis scripts. Use when writing data manipulation code, analysis pipelines, or .qmd scripts that process scientific/analytical data (e.g., filtering, joining, normalizing datasets). Do NOT load for general Python scripting, infrastructure code, or configuration management.
npx skillsauth add musserlab/lab-claude-skills data-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When writing data analysis code — in R or Python — follow these practices to ensure transparency, reproducibility, and catch errors early.
Group all data reads at the top of each script (or in dedicated setup/input chunks), with comments distinguishing external data from other scripts' outputs:
R:
# --- Inputs (from other scripts) ---
mdata <- readRDS(here("outs/01_analysis/mdata.rds"))
modules <- read_tsv(here("outs/02_module_lists/modules.tsv"))
# --- Inputs (external data) ---
gene_names <- read_tsv(here("data/gene_naming/spongilla_gene_names_final.tsv"))
Python:
# --- Inputs (from other scripts) ---
mdata = pd.read_parquet(PROJECT_ROOT / "outs/01_analysis/mdata.parquet")
modules = pd.read_csv(PROJECT_ROOT / "outs/02_module_lists/modules.tsv", sep="\t")
# --- Inputs (external data) ---
gene_names = pd.read_csv(PROJECT_ROOT / "data/gene_naming/spongilla_gene_names_final.tsv", sep="\t")
This makes dependencies self-documenting: reading the top of any script shows exactly what it needs and where those files come from. See the script-organization skill for full conventions.
Include summaries whenever datasets are created or substantially altered.
R:
# After loading data
data <- read_csv(here("data/raw_data.csv"))
cat("Loaded", nrow(data), "rows,", ncol(data), "columns\n")
glimpse(data)
# After major transformations
data_filtered <- data %>%
filter(quality_score > 0.8)
cat("After quality filter:", nrow(data_filtered), "of", nrow(data), "rows retained\n")
# After joins
data_merged <- data %>%
left_join(annotations, by = "gene_id")
cat("After annotation join:", nrow(data_merged), "rows,",
sum(!is.na(data_merged$annotation)), "with annotations\n")
Python:
# After loading data
data = pd.read_csv(PROJECT_ROOT / "data/raw_data.csv")
print(f"Loaded {len(data)} rows, {len(data.columns)} columns")
data.info()
# After major transformations
data_filtered = data.query("quality_score > 0.8")
print(f"After quality filter: {len(data_filtered)} of {len(data)} rows retained")
# After joins
data_merged = data.merge(annotations, on="gene_id", how="left")
print(f"After annotation join: {len(data_merged)} rows, "
f"{data_merged['annotation'].notna().sum()} with annotations")
When to show data:
Interpret "analytical decisions" broadly. Any operation that transforms, scales, or interprets data should be annotated with the reasoning. Document the "why" directly in code comments or markdown.
R:
# Normalize by library size using TMM (Robinson & Oshlack 2010)
# TMM chosen over RLE because of high proportion of zeros in sponge data
norm_factors <- calcNormFactors(dge, method = "TMM")
# Filter low-expression genes: require CPM > 1 in at least 3 samples
# Threshold based on smallest group size (n=3 per condition)
keep <- rowSums(cpm(dge) > 1) >= 3
dge <- dge[keep, ]
cat("Genes retained after expression filter:", sum(keep), "of", length(keep), "\n")
Python:
# Z-score normalize per gene across samples
# Chosen over quantile normalization to preserve relative differences
from scipy import stats
data_z = data.apply(stats.zscore, axis=1)
# Filter low-abundance features: require > 0 in at least 3 samples
# Threshold based on smallest group size (n=3 per condition)
keep = (data > 0).sum(axis=1) >= 3
data_filtered = data.loc[keep]
print(f"Features retained: {keep.sum()} of {len(keep)}")
What to annotate:
R:
cat("Before join:", nrow(data), "rows\n")
data <- data %>% inner_join(other_data, by = "key")
cat("After join:", nrow(data), "rows\n")
Python:
print(f"Before join: {len(data)} rows")
data = data.merge(other_data, on="key", how="inner")
print(f"After join: {len(data)} rows")
R:
unmatched <- data %>% anti_join(other_data, by = "key")
if (nrow(unmatched) > 0) {
cat("WARNING:", nrow(unmatched), "rows will not match\n")
cat("Unmatched keys:", head(unique(unmatched$key)), "...\n")
}
Python:
unmatched = data[~data["key"].isin(other_data["key"])]
if len(unmatched) > 0:
print(f"WARNING: {len(unmatched)} rows will not match")
print(f"Unmatched keys: {unmatched['key'].unique()[:5]}...")
R:
required_cols <- c("id", "value", "category")
missing <- setdiff(required_cols, names(data))
if (length(missing) > 0) stop("Missing columns: ", paste(missing, collapse = ", "))
Python:
required_cols = ["id", "value", "category"]
missing = set(required_cols) - set(data.columns)
if missing:
raise ValueError(f"Missing columns: {missing}")
R:
stopifnot("No data after filter" = nrow(data) > 0)
stopifnot("Unexpected NAs in key column" = !any(is.na(data$key)))
Python:
assert len(data) > 0, "No data after filter"
assert data["key"].notna().all(), "Unexpected NAs in key column"
lm(), glm(), lmFit() drop rows with NAs (complete case analysis)cor() with use = "complete.obs" excludes incomplete casesna.action = na.omitas.numeric() on character introduces NAspd.merge() with how="inner" silently drops unmatched rowsdf.dropna() can remove more rows than expected if applied to wide dataframesdf.groupby() excludes NA keys by default (use dropna=False to include)df.astype(float) on non-numeric strings raises errors (use pd.to_numeric(errors="coerce") — but this silently introduces NaN).value_counts() excludes NaN by default (use dropna=False)df[condition]["col"] = val) may fail silently — use .loc[]mean(), sum(), sd()/std() with NA removal (na.rm=TRUE / skipna=True) silently ignore NAs — report how many: cat/print("NAs ignored:", sum(is.na(x)))For rendered output: Put validation in chunks with #| include: false but keep summaries visible:
#| label: validate-join
#| include: false
#| message: true
# Validation (hidden in output)
unmatched <- data %>% anti_join(other, by = "key")
if (nrow(unmatched) > 0) {
message("WARNING: ", nrow(unmatched), " rows unmatched")
}
stopifnot(nrow(result) > 0)
#| label: show-result
# Summary (visible in output)
cat("Final dataset:", nrow(result), "rows\n")
glimpse(result)
Key pattern: cat() vs message()
cat() — verbose diagnostics, hidden with include: falsemessage() — warnings that appear during renderingprint()/glimpse() — data summaries, keep visible#| label: validate-join
#| include: false
# Validation (hidden in output)
unmatched = data[~data["key"].isin(other["key"])]
if len(unmatched) > 0:
import warnings
warnings.warn(f"{len(unmatched)} rows unmatched")
assert len(result) > 0, "No data after join"
#| label: show-result
# Summary (visible in output)
print(f"Final dataset: {len(result)} rows")
result.info()
result.head()
Key pattern: print() vs warnings.warn()
print() — diagnostics, hidden with include: false or kept visiblewarnings.warn() — warnings that surface in rendering outputdf.info() / df.head() / df.describe() — data summaries, keep visibleGuidance for working with .gz, .tar.gz, .zip, and other compressed input files.
Most tools can read compressed files directly — avoid unnecessary decompression.
R:
# readr handles .gz transparently — just use the .gz path
data <- read_tsv(here("data/counts.tsv.gz"))
# Base R also works with gzfile()
data <- read.csv(gzfile(here("data/counts.csv.gz")))
# For .zip files, use unz()
data <- read_tsv(unz(here("data/archive.zip"), "counts.tsv"))
Python:
# pandas handles .gz transparently
data = pd.read_csv(PROJECT_ROOT / "data/counts.tsv.gz", sep="\t")
# For explicit gzip handling
import gzip
with gzip.open(PROJECT_ROOT / "data/sequences.fasta.gz", "rt") as f:
content = f.read()
# BioPython handles .gz FASTA/FASTQ
from Bio import SeqIO
import gzip
with gzip.open("data/sequences.fasta.gz", "rt") as f:
records = list(SeqIO.parse(f, "fasta"))
Some tools require uncompressed files (e.g., certain bioinformatics tools that need random access). In that case:
data/ — data/ is read-onlyouts/<script>/ — the script's output directoryimport gzip
import shutil
gz_path = PROJECT_ROOT / "data/reference.fasta.gz"
decompressed = out_dir / "reference.fasta"
if not decompressed.exists():
with gzip.open(gz_path, "rb") as f_in:
with open(decompressed, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
print(f"Decompressed {gz_path.name} -> {decompressed.name}")
For multi-file archives, extract to outs/:
import tarfile
archive = PROJECT_ROOT / "data/reference_files.tar.gz"
extract_dir = out_dir / "reference_files"
if not extract_dir.exists():
with tarfile.open(archive, "r:gz") as tar:
tar.extractall(path=extract_dir)
print(f"Extracted {len(list(extract_dir.rglob('*')))} files to {extract_dir.name}/")
readr and Python's pandas handle .gz nativelydata/ — always decompress to outs/<script>/When writing analysis code interactively, do not just write code and move on. The user is a scientist who needs to stay informed about what's happening to the data. Treat coding as a conversation, not a monologue.
After every significant data operation, report to the user:
Before writing a join, verify key format on both sides:
head(), sample(), nchar(), grepl()head() alone can be misleading if early rows are unrepresentative (e.g., unannotated genes first). Use sample_n() or check the distribution.Geneid has format 'c12345-g1' for 30k genes but 'c12345-g1 Annotation...' for 10k — these won't match Trinity_geneID"When something doesn't match expectations, stop and say so:
seurat_name column has mostly bare Trinity IDs (25k) and only 4k with annotation — does that seem right?"This is not optional. Silently writing code that produces plausible-looking output is how bugs like column collisions and wrong join keys survive into production.
Column collisions from multiple joins:
When joining the same annotation table at multiple points in a script, check whether earlier joins already added columns that will collide. Example: joining gene_names[, c("id", "name", "label")] in Section 3 and again in Section 8 creates label.x/label.y. Fix: only join the columns you need at each point, or join once and carry the result forward.
Namespace masking (R/Bioconductor):
Loading Bioconductor packages (DESeq2, edgeR, AnnotationDbi) imports S4 generics that mask dplyr functions. Common victims: rename(), select(), filter(), count(), slice(). Always use dplyr::rename(), dplyr::select() etc. when Bioconductor packages are loaded in the same session.
Join key format verification: Before any join, verify the key column values (not just names) match on both sides:
# R: Check format on both sides before joining
head(table_a$key, 5)
head(table_b$key, 5)
sum(table_a$key %in% table_b$key) # how many will match?
# Python: Check format on both sides before joining
print(table_a["key"].head())
print(table_b["key"].head())
print(table_a["key"].isin(table_b["key"]).sum()) # how many will match?
The user is a scientist who needs to understand and approve the analytical choices being made. Do not silently resolve ambiguities or edge cases.
Always surface to the user before proceeding:
Concretely:
"unmatched" or NA, print them, and ask the user before assigning a defaultcase_when, if/elif/else), show the user what falls into the catch-all/else caseDo NOT assume the "obvious" answer is correct — what seems like a safe default may mask a real data issue. Never use a silent catch-all default that hides unmatched cases.
Before making important analytical decisions, stop and ask the user. Explain the available options, trade-offs, and your recommendation. Interpret "important" broadly — when in doubt, ask.
Always ask about:
How to ask:
Do not silently proceed if validation suggests problems or if an important analytical choice hasn't been discussed with the user.
development
Phylogenetic tree visualization and formatting with ggtree (R) or iTOL (web). Use when rendering a phylogenetic tree as a figure, choosing tree layout, coloring branches or labels by taxonomy, collapsing clades, displaying support values, or adding overlays to a tree. Do NOT load for tree inference (use protein-phylogeny skill) or domain annotation (future separate skill).
development
Configure and manage Claude Code security protections for sensitive files, credentials, and data. Use when the user invokes /security-setup to set up or modify protections against unauthorized file access, credential exposure, or sensitive data leaks.
development
Script organization for data science analysis projects with numbered scripts, data/outs/ directories, and reproducibility conventions. Use when creating new analysis scripts in projects that follow data science conventions (numbered XX_ prefix scripts, outs/ directories, BUILD_INFO.txt). Do NOT load for documentation projects (Quarto books), infrastructure repos, or projects without data/outs/ directory structure.
testing
R renv package management for data science projects. Use when working with renv (renv.lock, renv::restore, renv::snapshot) in R analysis projects. Do NOT load for projects that do not use R or renv.