skills/script-organization/SKILL.md
Script organization for data science analysis projects with numbered scripts, data/outs/ directories, and reproducibility conventions. Use when creating new analysis scripts in projects that follow data science conventions (numbered XX_ prefix scripts, outs/ directories, BUILD_INFO.txt). Do NOT load for documentation projects (Quarto books), infrastructure repos, or projects without data/outs/ directory structure.
npx skillsauth add musserlab/lab-claude-skills script-organizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Conventions for script numbering, input/output tracking, directory structure, and build provenance. The goal is to make data flow between scripts self-documenting through directory structure and path references, without requiring separate manifest files or pipeline tools.
Numbered analysis scripts in scripts/ follow a different default depending on environment:
| Environment | Detected by | Default | Override |
|---|---|---|---|
| Local (macOS) | absence of /nfs/roberts/ | .qmd | # allow-py: <reason> comment in the first 20 lines of a numbered .py |
| Cluster (Bouchet) | presence of /nfs/roberts/ | .py | None needed — .py is the default |
The enforce-qmd-scripts.sh hook enforces this on local; it auto-skips on the cluster. Helpers in R//python/ and exploratory/scratch files are unaffected by the rule.
Why .py on the cluster: Quarto has NFS cleanup issues and requires extra Jupyter dependencies (nbformat, nbclient, ipykernel, pyyaml) that may not be in every conda env. .py scripts run anywhere with a Python interpreter and produce the same outputs (plots saved to files, BUILD_INFO.txt, summary stats printed to stdout).
When to override on local: Cases where .py is genuinely better suited — e.g., a cluster-bound pipeline being prototyped locally, a long-running headless job, or a script that will only ever be invoked from the command line. Always ask the user before applying the override marker, since .qmd is the strong local default. Helpers belong in python//R/ and don't need an override.
.qmd for locally rendered reports is always available regardless of environment — interactive exploration, publication figures with narrative, or when inline HTML output is valuable.
The override marker is a single comment line in the first 20 lines of the file:
#!/usr/bin/env python3
# allow-py: needs to run as a long SLURM job; .qmd Jupyter deps not in target env
"""..."""
The marker is informational — it documents the justification and silences the hook. Removing it later will cause the hook to block edits, which is intentional (forces re-justification).
For small projects with <10 scripts on a single topic:
project/
R/ # Shared R helpers
python/ # Shared Python helpers
scripts/
01_analysis.qmd
02_plots.qmd
exploratory/ # One-off analyses
data/ # External/immutable inputs only
outs/
01_analysis/ # Outputs from script 01
mdata.rds
01_analysis.html # Rendered HTML
BUILD_INFO.txt
02_plots/
volcano.pdf
02_plots.html
BUILD_INFO.txt
exploratory/
For larger projects with multiple analytical threads:
project/
R/ # Shared R helpers (project-level)
python/ # Shared Python helpers (project-level)
scripts/
phosphoproteomics/
01_analysis.qmd
02_volcano_plots.qmd
transcriptomics/
01_heatmaps.qmd
exploratory/
data/
gene_naming/ # Shared external data
phosphoproteomics/ # Section-specific external data
transcriptomics/
outs/
phosphoproteomics/
01_analysis/
02_volcano_plots/
transcriptomics/
01_heatmaps/
exploratory/
.claude/
PHOSPHOPROTEOMICS_PLAN.md
TRANSCRIPTOMICS_PLAN.md
Each section has its own script numbering (starting at 01_). Sections may have one or more planning documents in .claude/.
Projects that submit SLURM jobs on the HPC cluster add two directories:
project/
batch/ # SLURM batch scripts (.sh) — tracked in git
logs/ # SLURM output files (slurm-*.out) — tracked in git
scripts/
data/
outs/
Both batch/ and logs/ are tracked in git: batch scripts are code, and logs are the
reproducibility record (they capture provenance, tool versions, and runtime diagnostics).
See the hpc skill for batch script conventions and job resource templates.
Script format on the cluster: .py is the default — see the "Script Format by Environment" section above for the full rule (and the local .qmd default + override marker).
.py + .sh Pairing ConventionAnalysis logic and SLURM job configuration are always separate files:
.py script in scripts/<section>/ — contains all analysis logic, reads inputs,
writes outputs, produces plots. Self-contained: can be run interactively, locally, or
via SLURM. Uses PROJECT_ROOT from git, not hardcoded paths..sh batch script in batch/ — thin SLURM wrapper that sets resource requests,
activates the conda environment, and calls python scripts/<section>/XX_script.py.
Contains no analysis logic.This separation means:
.py script can be run directly (python scripts/annotation/05_mapping.py) for
debugging, interactive development, or local execution.sh script is short and templated (see hpc skill for the template)Numbering: .py scripts are numbered per-section as usual (05a_, 05b_, etc.).
Batch scripts in batch/ use their own numbering sequence (e.g., 11a_, 11b_), since
batch/ is flat and shared across all sections. The batch script name should make the
connection clear (e.g., batch/11a_trinity_genome_mapping.sh calls
scripts/annotation/05a_trinity_genome_mapping.py).
Commit before execute: Always commit scripts before executing them — whether submitting
a batch job on the cluster (sbatch) or rendering a .qmd locally (quarto render). The
git hash recorded in BUILD_INFO.txt must reflect the code that actually ran. If the script
was modified but not committed, the hash points to stale code and the provenance record is
broken. See also the hpc skill for the cluster-specific convention.
When creating a new script in a sectioned project:
data/ vs outs/| Folder | Contains | Written by |
|--------|----------|------------|
| data/ | External/immutable inputs: raw data, collaborator files, annotations, database exports | Nothing in this project — files arrive from outside |
| outs/<script_name>/ | All outputs produced by a script (data files, plots, rendered HTML, BUILD_INFO.txt) | That script only |
Rule: If your code produced it, it goes in outs/. If it came from anywhere else, it goes in data/. Scripts never write to data/.
Scripts are numbered per-section (01_, 02_, etc.) so ls shows them in a sensible order. Numbers are labels, not dependency order. Dependencies are encoded entirely by input paths within each script.
01_ in each sectionUse letter suffixes when a single topic requires multiple scripts. Common reasons:
.qmd + plotting companion .R).qmd (see Cross-Language rule below)Rules for lettered scripts:
outs/XX_topic_name/ — NOT outs/XXa_name/, outs/XXb_name/. The output dir uses the number without a letter.a script runs first. Letters imply execution order within the set.15a_wgcna_threshold.qmd, 15b_wgcna_modules.qmd, 15c_wgcna_plots.R — all share outs/15_wgcna_platynereis/..R or .py alongside .qmd) are acceptable for lightweight tasks (plotting, utilities).When to use a new number vs a letter:
Every analysis script declares its status:
.qmd — status field in YAML frontmatter.py — Status: development line in the module docstring| Status | Meaning | Location |
|--------|---------|----------|
| development | In active development, outputs are provisional | scripts/ |
| finalized | Outputs are publication-ready; modify only with deliberate re-validation | scripts/ |
| deprecated | Superseded; kept for reference | scripts/old/ or scripts/<section>/old/ |
When deprecating, note the replacement in the frontmatter (.qmd) or docstring (.py).
Planning documents remain the authoritative tracker of script status across the project.
scripts/exploratory/ (or scripts/<section>/exploratory/) is for one-off analyses, quick tests, and feasibility checks:
outs/)Dependencies between scripts are self-documenting through paths. Group all input reads at the top of each script (or in the setup chunk), with comments distinguishing external data from other scripts' outputs.
R example:
# --- Inputs (from other scripts) ---
mdata <- readRDS(here("outs/phosphoproteomics/01_analysis/mdata.rds"))
modules <- read_tsv(here("outs/phosphoproteomics/02_module_lists/modules.tsv"))
# --- Inputs (external data) ---
gene_names <- read_tsv(here("data/gene_naming/spongilla_gene_names_final.tsv"))
Python example:
# --- Inputs (from other scripts) ---
modules = pd.read_csv(PROJECT_ROOT / "outs/phosphoproteomics/02_module_lists/modules.tsv", sep="\t")
# --- Inputs (external data) ---
gene_names = pd.read_csv(PROJECT_ROOT / "data/gene_naming/spongilla_gene_names_final.tsv", sep="\t")
Reading the top of any script shows exactly what it depends on and which upstream scripts produced those files. No separate DAG documentation needed.
When a script re-renders, it should archive existing outputs before writing new ones.
This prevents stale files from previous runs from lingering in outs/ and provides
a history of previous outputs.
Convention: At the start of each script (after creating out_dir), move all
existing files into out_dir/_archive/<timestamp>/. The timestamp comes from the
previous BUILD_INFO.txt mtime (reflecting when those outputs were actually produced),
falling back to the newest file mtime if BUILD_INFO.txt doesn't exist.
Python:
import shutil
from datetime import datetime
existing_items = [f for f in out_dir.iterdir() if f.name != "_archive"]
if existing_items:
build_info = out_dir / "BUILD_INFO.txt"
if build_info.exists():
orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
else:
all_files = [f for f in out_dir.rglob("*") if f.is_file() and "_archive" not in str(f)]
orig_time = datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files)) if all_files else datetime.now()
archive_dir = out_dir / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
archive_dir.mkdir(parents=True, exist_ok=True)
for item in existing_items:
shutil.move(str(item), str(archive_dir / item.name))
print(f"Archived {len(existing_items)} items → {archive_dir.name}")
R:
existing_files <- list.files(out_dir, full.names = TRUE)
existing_files <- existing_files[!file.info(existing_files)$isdir]
if (length(existing_files) > 0) {
build_info <- file.path(out_dir, "BUILD_INFO.txt")
if (file.exists(build_info)) {
orig_time <- file.info(build_info)$mtime
} else {
orig_time <- max(file.info(existing_files)$mtime)
}
archive_dir <- file.path(out_dir, "_archive", format(orig_time, "%Y-%m-%d_%H%M%S"))
dir.create(archive_dir, recursive = TRUE, showWarnings = FALSE)
file.rename(existing_files, file.path(archive_dir, basename(existing_files)))
message("Archived ", length(existing_files), " previous outputs → ", basename(archive_dir))
}
Notes:
_archive/ itself is never moved)_archive/ directory accumulates over time; periodically clean old archives.qmd, .py, .R)Every script captures the current git commit hash in its setup chunk and prints it into the rendered output. Six months later, you can check out that exact commit to see the state of all code at the time the output was produced.
Every script writes a BUILD_INFO.txt to its output folder as its last action:
script: 01_analysis.qmd
commit: a1b2c3d
date: 2026-02-14 15:30:00
slurm_job_id: 6380027
The slurm_job_id line is written only when the script runs via SLURM (i.e., $SLURM_JOB_ID
is set). This links the output folder to its log file (logs/slurm-*-<job_id>.out), which
is essential when reruns produce multiple log files. In Python:
slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
# ... in BUILD_INFO write block:
if slurm_job_id:
f.write(f"slurm_job_id: {slurm_job_id}\n")
This answers: "When was this output folder last regenerated, from what code, and which log file has the details?" If downstream plots look wrong, check the upstream folder's BUILD_INFO.txt to see whether it was generated from current code or something stale.
BUILD_INFO.txt lives in outs/ and is not tracked by git (since outs/ is in .gitignore).
Rendered .html output goes into outs/<script_name>/ alongside data outputs, keeping scripts/ clean.
See the quarto-docs skill for complete QMD templates with git hash and BUILD_INFO.txt chunks.
.py Analysis Script TemplateFor cluster projects, .py is the default analysis script format. The template carries
over the same reproducibility features as .qmd (git hash, BUILD_INFO.txt, structured
inputs, archive-before-overwrite) without requiring Quarto.
#!/usr/bin/env python3
"""Short description of what this script does.
Input: data/... (external), outs/.../file.tsv (from script XX)
Output: outs/section/XX_script_name/
Status: development
"""
import subprocess
import sys
from datetime import datetime
from pathlib import Path
import matplotlib
matplotlib.use("Agg") # headless — saves to files, no display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# ── Setup ─────────────────────────────────────────────────────────────────────
PROJECT_ROOT = Path(
subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
sys.path.insert(0, str(PROJECT_ROOT / "python"))
GIT_HASH = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"], text=True
).strip()
print(f"Git hash: {GIT_HASH}")
OUT_DIR = PROJECT_ROOT / "outs" / "section" / "XX_script_name"
OUT_DIR.mkdir(parents=True, exist_ok=True)
# ── Archive previous outputs ─────────────────────────────────────────────────
import shutil
existing_items = [f for f in OUT_DIR.iterdir() if f.name != "_archive"]
if existing_items:
build_info = OUT_DIR / "BUILD_INFO.txt"
if build_info.exists():
orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
else:
all_files = [
f for f in OUT_DIR.rglob("*")
if f.is_file() and "_archive" not in str(f)
]
orig_time = (
datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files))
if all_files else datetime.now()
)
archive_dir = OUT_DIR / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
archive_dir.mkdir(parents=True, exist_ok=True)
for item in existing_items:
shutil.move(str(item), str(archive_dir / item.name))
print(f"Archived {len(existing_items)} items -> {archive_dir.name}")
# ── Inputs ────────────────────────────────────────────────────────────────────
# --- Inputs (from other scripts) ---
# upstream = pd.read_csv(PROJECT_ROOT / "outs/.../file.tsv", sep="\t")
# --- Inputs (external data) ---
# raw_data = pd.read_csv(PROJECT_ROOT / "data/.../file.tsv", sep="\t")
# ── Analysis step 1 ───────────────────────────────────────────────────────────
#
# Describe WHAT this step does and WHY — the analytical reasoning, not just
# code mechanics. What question does this step answer? What should the reader
# look for in the output? This replaces the markdown narrative from .qmd files.
#
# Each major section should have a block comment like this. Not every line
# needs a comment, but every analytical step needs context. Also annotate:
# - Critical lines (thresholds, assumptions, non-obvious logic)
# - Tricky or surprising code that would confuse a reader
# ... analysis code, plots saved to OUT_DIR ...
# ── BUILD_INFO ────────────────────────────────────────────────────────────────
slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
with open(OUT_DIR / "BUILD_INFO.txt", "w") as f:
f.write(f"script: scripts/section/XX_script_name.py\n")
f.write(f"commit: {GIT_HASH}\n")
f.write(f"date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
if slurm_job_id:
f.write(f"slurm_job_id: {slurm_job_id}\n")
print("BUILD_INFO.txt written")
Key features:
matplotlib.use("Agg") for headless rendering (no display server needed)outs/, all reads from data/ or upstream outs/PROJECT_ROOT from git, not hardcoded pathspython script.py | tee log.txt)Shared helper functions live in R/ and python/ at the project root:
# R scripts load helpers with:
source(here("R/gene_name_helpers.R"))
# Python scripts load helpers with:
import sys
sys.path.insert(0, str(PROJECT_ROOT / "python"))
from gene_name_helpers import normalize_name
Rules:
make_gene_short, not make_gene_short_v2). Fix functions in place; git tracks the history. If a function's interface genuinely changes (different inputs/outputs/purpose), give it a descriptive name reflecting what it does, not when it was written.Functions shared across multiple projects live in ~/lib/R/ and ~/lib/python/:
source("~/lib/R/plotting_helpers.R")
When a project-level function proves useful across 2+ projects, promote it to ~/lib/. The ~/lib/ directory should be a git repository for version tracking.
Future: When distributing functions to collaborators, graduate shared functions into installable R/Python packages.
When data produced by an R script will be read by a Python script (or vice versa), use Parquet:
arrow::write_parquet() / arrow::read_parquet()pd.to_parquet() / pd.read_parquet()Avoid .rds (R-only) or .pkl (Python-only) for data that crosses the language boundary. Within a single language, native formats (.rds for R) are fine.
Prefer single-language .qmd files. When a script needs both R and Python, split into lettered scripts (e.g., XXa_ in Python, XXb_ in R) that communicate through files in outs/XX_topic/.
Exception: A single mixed-language .qmd is acceptable when both languages operate on the same data in a tight pipeline (e.g., Python reads h5ad → saves TSV → R builds a tree in the next chunk). In this case, data passes via files on disk, not shared memory — do not rely on reticulate object passing.
development
Phylogenetic tree visualization and formatting with ggtree (R) or iTOL (web). Use when rendering a phylogenetic tree as a figure, choosing tree layout, coloring branches or labels by taxonomy, collapsing clades, displaying support values, or adding overlays to a tree. Do NOT load for tree inference (use protein-phylogeny skill) or domain annotation (future separate skill).
development
Configure and manage Claude Code security protections for sensitive files, credentials, and data. Use when the user invokes /security-setup to set up or modify protections against unauthorized file access, credential exposure, or sensitive data leaks.
testing
R renv package management for data science projects. Use when working with renv (renv.lock, renv::restore, renv::snapshot) in R analysis projects. Do NOT load for projects that do not use R or renv.
development
R ggplot2 plotting conventions and theme. Use when creating, modifying, or styling ggplot2 plots in R, or when adjusting plot themes, colors, labels, or formatting.