Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

musserlab/script-organization

Name: script-organization
Author: musserlab

skills/script-organization/SKILL.md

npx skillsauth add musserlab/lab-claude-skills script-organization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Script Organization and Reproducibility

Conventions for script numbering, input/output tracking, directory structure, and build provenance. The goal is to make data flow between scripts self-documenting through directory structure and path references, without requiring separate manifest files or pipeline tools.

Script Format by Environment

Numbered analysis scripts in scripts/ follow a different default depending on environment:

| Environment | Detected by | Default | Override | |---|---|---|---| | Local (macOS) | absence of /nfs/roberts/ | .qmd | # allow-py: <reason> comment in the first 20 lines of a numbered .py | | Cluster (Bouchet) | presence of /nfs/roberts/ | .py | None needed — .py is the default |

The enforce-qmd-scripts.sh hook enforces this on local; it auto-skips on the cluster. Helpers in R//python/ and exploratory/scratch files are unaffected by the rule.

Why .py on the cluster: Quarto has NFS cleanup issues and requires extra Jupyter dependencies (nbformat, nbclient, ipykernel, pyyaml) that may not be in every conda env. .py scripts run anywhere with a Python interpreter and produce the same outputs (plots saved to files, BUILD_INFO.txt, summary stats printed to stdout).

When to override on local: Cases where .py is genuinely better suited — e.g., a cluster-bound pipeline being prototyped locally, a long-running headless job, or a script that will only ever be invoked from the command line. Always ask the user before applying the override marker, since .qmd is the strong local default. Helpers belong in python//R/ and don't need an override.

.qmd for locally rendered reports is always available regardless of environment — interactive exploration, publication figures with narrative, or when inline HTML output is valuable.

The override marker is a single comment line in the first 20 lines of the file:

#!/usr/bin/env python3
# allow-py: needs to run as a long SLURM job; .qmd Jupyter deps not in target env
"""..."""

The marker is informational — it documents the justification and silences the hook. Removing it later will cause the hook to block edits, which is intentional (forces re-justification).

Directory Structure

Flat Layout

For small projects with <10 scripts on a single topic:

project/
  R/                          # Shared R helpers
  python/                     # Shared Python helpers
  scripts/
    01_analysis.qmd
    02_plots.qmd
    exploratory/              # One-off analyses
  data/                       # External/immutable inputs only
  outs/
    01_analysis/              # Outputs from script 01
      mdata.rds
      01_analysis.html        # Rendered HTML
      BUILD_INFO.txt
    02_plots/
      volcano.pdf
      02_plots.html
      BUILD_INFO.txt
    exploratory/

Sectioned Layout

For larger projects with multiple analytical threads:

project/
  R/                          # Shared R helpers (project-level)
  python/                     # Shared Python helpers (project-level)
  scripts/
    phosphoproteomics/
      01_analysis.qmd
      02_volcano_plots.qmd
    transcriptomics/
      01_heatmaps.qmd
    exploratory/
  data/
    gene_naming/              # Shared external data
    phosphoproteomics/        # Section-specific external data
    transcriptomics/
  outs/
    phosphoproteomics/
      01_analysis/
      02_volcano_plots/
    transcriptomics/
      01_heatmaps/
    exploratory/
  .claude/
    PHOSPHOPROTEOMICS_PLAN.md
    TRANSCRIPTOMICS_PLAN.md

Each section has its own script numbering (starting at 01_). Sections may have one or more planning documents in .claude/.

Cluster Projects

Projects that submit SLURM jobs on the HPC cluster add two directories:

project/
  batch/                        # SLURM batch scripts (.sh) — tracked in git
  logs/                         # SLURM output files (slurm-*.out) — tracked in git
  scripts/
  data/
  outs/

Both batch/ and logs/ are tracked in git: batch scripts are code, and logs are the reproducibility record (they capture provenance, tool versions, and runtime diagnostics). See the hpc skill for batch script conventions and job resource templates.

Script format on the cluster: .py is the default — see the "Script Format by Environment" section above for the full rule (and the local .qmd default + override marker).

The `.py` + `.sh` Pairing Convention

Analysis logic and SLURM job configuration are always separate files:

.py script in scripts/<section>/ — contains all analysis logic, reads inputs, writes outputs, produces plots. Self-contained: can be run interactively, locally, or via SLURM. Uses PROJECT_ROOT from git, not hardcoded paths.
.sh batch script in batch/ — thin SLURM wrapper that sets resource requests, activates the conda environment, and calls python scripts/<section>/XX_script.py. Contains no analysis logic.

This separation means:

The .py script can be run directly (python scripts/annotation/05_mapping.py) for debugging, interactive development, or local execution
SLURM resources can be adjusted without touching analysis code
The .sh script is short and templated (see hpc skill for the template)

Numbering: .py scripts are numbered per-section as usual (05a_, 05b_, etc.). Batch scripts in batch/ use their own numbering sequence (e.g., 11a_, 11b_), since batch/ is flat and shared across all sections. The batch script name should make the connection clear (e.g., batch/11a_trinity_genome_mapping.sh calls scripts/annotation/05a_trinity_genome_mapping.py).

Commit before execute: Always commit scripts before executing them — whether submitting a batch job on the cluster (sbatch) or rendering a .qmd locally (quarto render). The git hash recorded in BUILD_INFO.txt must reflect the code that actually ran. If the script was modified but not committed, the hash points to stale code and the provenance record is broken. See also the hpc skill for the cluster-specific convention.

Choosing a Subdirectory

When creating a new script in a sectioned project:

Check the project's CLAUDE.md for a Script Subdirectories table listing each subdirectory and its scope
If the task clearly fits one subdirectory, use it
If ambiguous, ask the user which subdirectory to use before creating the script
If none of the existing subdirectories fit, propose creating a new one

When to Use Each Layout

Flat: Single topic, small scope, fewer than ~10 scripts
Sectioned: Multiple distinct analytical threads, especially when different people work on different sections

`data/` vs `outs/`

| Folder | Contains | Written by | |--------|----------|------------| | data/ | External/immutable inputs: raw data, collaborator files, annotations, database exports | Nothing in this project — files arrive from outside | | outs/<script_name>/ | All outputs produced by a script (data files, plots, rendered HTML, BUILD_INFO.txt) | That script only |

Rule: If your code produced it, it goes in outs/. If it came from anywhere else, it goes in data/. Scripts never write to data/.

Script Numbering

Scripts are numbered per-section (01_, 02_, etc.) so ls shows them in a sensible order. Numbers are labels, not dependency order. Dependencies are encoded entirely by input paths within each script.

Assign the next available number when adding a script
Never renumber existing scripts when one is archived or deleted
In sectioned projects, numbering restarts at 01_ in each section

Letter Suffixes (a, b, c)

Use letter suffixes when a single topic requires multiple scripts. Common reasons:

User review needed between steps (e.g., threshold selection → module detection)
Different output types (e.g., main analysis .qmd + plotting companion .R)
Language split when R and Python steps cannot share a .qmd (see Cross-Language rule below)

Rules for lettered scripts:

Same topic, same number. A new topic gets a new number, not a new letter.
Shared output directory. All scripts in a lettered set write to outs/XX_topic_name/ — NOT outs/XXa_name/, outs/XXb_name/. The output dir uses the number without a letter.
The a script runs first. Letters imply execution order within the set.
Name the set consistently. 15a_wgcna_threshold.qmd, 15b_wgcna_modules.qmd, 15c_wgcna_plots.R — all share outs/15_wgcna_platynereis/.
Companion scripts (.R or .py alongside .qmd) are acceptable for lightweight tasks (plotting, utilities).

When to use a new number vs a letter:

New number: different analytical question, different input data, different topic
Letter suffix: same topic split across steps, same conceptual analysis

Script Lifecycle

Every analysis script declares its status:

.qmd — status field in YAML frontmatter
.py — Status: development line in the module docstring

| Status | Meaning | Location | |--------|---------|----------| | development | In active development, outputs are provisional | scripts/ | | finalized | Outputs are publication-ready; modify only with deliberate re-validation | scripts/ | | deprecated | Superseded; kept for reference | scripts/old/ or scripts/<section>/old/ |

When deprecating, note the replacement in the frontmatter (.qmd) or docstring (.py).

Planning documents remain the authoritative tracker of script status across the project.

Exploratory Directory

scripts/exploratory/ (or scripts/<section>/exploratory/) is for one-off analyses, quick tests, and feasibility checks:

No number prefixes or BUILD_INFO.txt required
Other scripts must never depend on exploratory outputs (one-way dependency: exploratory scripts can read from any section's outs/)
Can be cleaned out periodically without breaking anything
No planning document needed
Good candidates for promotion: if an exploratory script proves useful, promote it to a numbered script in the appropriate section

Input/Output Tracking

Dependencies between scripts are self-documenting through paths. Group all input reads at the top of each script (or in the setup chunk), with comments distinguishing external data from other scripts' outputs.

R example:

# --- Inputs (from other scripts) ---
mdata <- readRDS(here("outs/phosphoproteomics/01_analysis/mdata.rds"))
modules <- read_tsv(here("outs/phosphoproteomics/02_module_lists/modules.tsv"))

# --- Inputs (external data) ---
gene_names <- read_tsv(here("data/gene_naming/spongilla_gene_names_final.tsv"))

Python example:

# --- Inputs (from other scripts) ---
modules = pd.read_csv(PROJECT_ROOT / "outs/phosphoproteomics/02_module_lists/modules.tsv", sep="\t")

# --- Inputs (external data) ---
gene_names = pd.read_csv(PROJECT_ROOT / "data/gene_naming/spongilla_gene_names_final.tsv", sep="\t")

Reading the top of any script shows exactly what it depends on and which upstream scripts produced those files. No separate DAG documentation needed.

Provenance

Archive Before Overwrite

When a script re-renders, it should archive existing outputs before writing new ones. This prevents stale files from previous runs from lingering in outs/ and provides a history of previous outputs.

Convention: At the start of each script (after creating out_dir), move all existing files into out_dir/_archive/<timestamp>/. The timestamp comes from the previous BUILD_INFO.txt mtime (reflecting when those outputs were actually produced), falling back to the newest file mtime if BUILD_INFO.txt doesn't exist.

Python:

import shutil
from datetime import datetime

existing_items = [f for f in out_dir.iterdir() if f.name != "_archive"]
if existing_items:
    build_info = out_dir / "BUILD_INFO.txt"
    if build_info.exists():
        orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
    else:
        all_files = [f for f in out_dir.rglob("*") if f.is_file() and "_archive" not in str(f)]
        orig_time = datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files)) if all_files else datetime.now()

    archive_dir = out_dir / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
    archive_dir.mkdir(parents=True, exist_ok=True)
    for item in existing_items:
        shutil.move(str(item), str(archive_dir / item.name))
    print(f"Archived {len(existing_items)} items → {archive_dir.name}")

existing_files <- list.files(out_dir, full.names = TRUE)
existing_files <- existing_files[!file.info(existing_files)$isdir]
if (length(existing_files) > 0) {
  build_info <- file.path(out_dir, "BUILD_INFO.txt")
  if (file.exists(build_info)) {
    orig_time <- file.info(build_info)$mtime
  } else {
    orig_time <- max(file.info(existing_files)$mtime)
  }
  archive_dir <- file.path(out_dir, "_archive", format(orig_time, "%Y-%m-%d_%H%M%S"))
  dir.create(archive_dir, recursive = TRUE, showWarnings = FALSE)
  file.rename(existing_files, file.path(archive_dir, basename(existing_files)))
  message("Archived ", length(existing_files), " previous outputs → ", basename(archive_dir))
}

Notes:

Only files are archived, not subdirectories (so _archive/ itself is never moved)
The _archive/ directory accumulates over time; periodically clean old archives
This pattern applies to all script types (.qmd, .py, .R)

Git Hash

Every script captures the current git commit hash in its setup chunk and prints it into the rendered output. Six months later, you can check out that exact commit to see the state of all code at the time the output was produced.

BUILD_INFO.txt

Every script writes a BUILD_INFO.txt to its output folder as its last action:

script: 01_analysis.qmd
commit: a1b2c3d
date: 2026-02-14 15:30:00
slurm_job_id: 6380027

The slurm_job_id line is written only when the script runs via SLURM (i.e., $SLURM_JOB_ID is set). This links the output folder to its log file (logs/slurm-*-<job_id>.out), which is essential when reruns produce multiple log files. In Python:

slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
# ... in BUILD_INFO write block:
if slurm_job_id:
    f.write(f"slurm_job_id: {slurm_job_id}\n")

This answers: "When was this output folder last regenerated, from what code, and which log file has the details?" If downstream plots look wrong, check the upstream folder's BUILD_INFO.txt to see whether it was generated from current code or something stale.

BUILD_INFO.txt lives in outs/ and is not tracked by git (since outs/ is in .gitignore).

Rendered HTML

Rendered .html output goes into outs/<script_name>/ alongside data outputs, keeping scripts/ clean.

See the quarto-docs skill for complete QMD templates with git hash and BUILD_INFO.txt chunks.

`.py` Analysis Script Template

For cluster projects, .py is the default analysis script format. The template carries over the same reproducibility features as .qmd (git hash, BUILD_INFO.txt, structured inputs, archive-before-overwrite) without requiring Quarto.

#!/usr/bin/env python3
"""Short description of what this script does.

Input:  data/... (external), outs/.../file.tsv (from script XX)
Output: outs/section/XX_script_name/

Status: development
"""

import subprocess
import sys
from datetime import datetime
from pathlib import Path

import matplotlib
matplotlib.use("Agg")  # headless — saves to files, no display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# ── Setup ─────────────────────────────────────────────────────────────────────

PROJECT_ROOT = Path(
    subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
sys.path.insert(0, str(PROJECT_ROOT / "python"))

GIT_HASH = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"], text=True
).strip()
print(f"Git hash: {GIT_HASH}")

OUT_DIR = PROJECT_ROOT / "outs" / "section" / "XX_script_name"
OUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Archive previous outputs ─────────────────────────────────────────────────

import shutil

existing_items = [f for f in OUT_DIR.iterdir() if f.name != "_archive"]
if existing_items:
    build_info = OUT_DIR / "BUILD_INFO.txt"
    if build_info.exists():
        orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
    else:
        all_files = [
            f for f in OUT_DIR.rglob("*")
            if f.is_file() and "_archive" not in str(f)
        ]
        orig_time = (
            datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files))
            if all_files else datetime.now()
        )
    archive_dir = OUT_DIR / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
    archive_dir.mkdir(parents=True, exist_ok=True)
    for item in existing_items:
        shutil.move(str(item), str(archive_dir / item.name))
    print(f"Archived {len(existing_items)} items -> {archive_dir.name}")

# ── Inputs ────────────────────────────────────────────────────────────────────

# --- Inputs (from other scripts) ---
# upstream = pd.read_csv(PROJECT_ROOT / "outs/.../file.tsv", sep="\t")

# --- Inputs (external data) ---
# raw_data = pd.read_csv(PROJECT_ROOT / "data/.../file.tsv", sep="\t")

# ── Analysis step 1 ───────────────────────────────────────────────────────────
#
# Describe WHAT this step does and WHY — the analytical reasoning, not just
# code mechanics. What question does this step answer? What should the reader
# look for in the output? This replaces the markdown narrative from .qmd files.
#
# Each major section should have a block comment like this. Not every line
# needs a comment, but every analytical step needs context. Also annotate:
# - Critical lines (thresholds, assumptions, non-obvious logic)
# - Tricky or surprising code that would confuse a reader

# ... analysis code, plots saved to OUT_DIR ...

# ── BUILD_INFO ────────────────────────────────────────────────────────────────

slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
with open(OUT_DIR / "BUILD_INFO.txt", "w") as f:
    f.write(f"script: scripts/section/XX_script_name.py\n")
    f.write(f"commit: {GIT_HASH}\n")
    f.write(f"date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    if slurm_job_id:
        f.write(f"slurm_job_id: {slurm_job_id}\n")
print("BUILD_INFO.txt written")

Key features:

matplotlib.use("Agg") for headless rendering (no display server needed)
Git hash captured at start, printed to stdout, written to BUILD_INFO.txt
Archive-before-overwrite preserves previous outputs
Docstring with status, inputs, outputs serves as the script's documentation
All output goes to outs/, all reads from data/ or upstream outs/
PROJECT_ROOT from git, not hardcoded paths
Stdout serves as the execution log (redirect with python script.py | tee log.txt)

Helper Functions

Project-Level

Shared helper functions live in R/ and python/ at the project root:

# R scripts load helpers with:
source(here("R/gene_name_helpers.R"))

# Python scripts load helpers with:
import sys
sys.path.insert(0, str(PROJECT_ROOT / "python"))
from gene_name_helpers import normalize_name

Rules:

Do not version function names (make_gene_short, not make_gene_short_v2). Fix functions in place; git tracks the history. If a function's interface genuinely changes (different inputs/outputs/purpose), give it a descriptive name reflecting what it does, not when it was written.
Do not duplicate the same function in both R and Python within a project. Each function lives in one language. If a script in the other language needs that logic, rewrite it once in the new language and retire the old one.

Cross-Project

Functions shared across multiple projects live in ~/lib/R/ and ~/lib/python/:

source("~/lib/R/plotting_helpers.R")

When a project-level function proves useful across 2+ projects, promote it to ~/lib/. The ~/lib/ directory should be a git repository for version tracking.

Future: When distributing functions to collaborators, graduate shared functions into installable R/Python packages.

Cross-Language Data Interchange

When data produced by an R script will be read by a Python script (or vice versa), use Parquet:

Smaller than TSV, preserves column types, fast in both languages
R: arrow::write_parquet() / arrow::read_parquet()
Python: pd.to_parquet() / pd.read_parquet()

Avoid .rds (R-only) or .pkl (Python-only) for data that crosses the language boundary. Within a single language, native formats (.rds for R) are fine.

Cross-Language Scripts

Prefer single-language .qmd files. When a script needs both R and Python, split into lettered scripts (e.g., XXa_ in Python, XXb_ in R) that communicate through files in outs/XX_topic/.

Exception: A single mixed-language .qmd is acceptable when both languages operate on the same data in a tight pipeline (e.g., Python reads h5ad → saves TSV → R builds a tree in the next chunk). In this case, data passes via files on disk, not shared memory — do not rely on reticulate object passing.

musserlab/script-organization

skills/script-organization/SKILL.md

Script organization for data science analysis projects with numbered scripts, data/outs/ directories, and reproducibility conventions. Use when creating new analysis scripts in projects that follow data science conventions (numbered XX_ prefix scripts, outs/ directories, BUILD_INFO.txt). Do NOT load for documentation projects (Quarto books), infrastructure repos, or projects without data/outs/ directory structure.

1 stars

development

Updated May 9, 2026

$ install --global

skillsauth

npx skillsauth add musserlab/lab-claude-skills script-organization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 9, 2026, 7:32 AM157.0s1 file scanned

SKILL.md

name:: script-organization
description:: >
user-invocable:: false

Script Organization and Reproducibility

Script Format by Environment

Numbered analysis scripts in scripts/ follow a different default depending on environment:

The enforce-qmd-scripts.sh hook enforces this on local; it auto-skips on the cluster. Helpers in R//python/ and exploratory/scratch files are unaffected by the rule.

.qmd for locally rendered reports is always available regardless of environment — interactive exploration, publication figures with narrative, or when inline HTML output is valuable.

The override marker is a single comment line in the first 20 lines of the file:

#!/usr/bin/env python3
# allow-py: needs to run as a long SLURM job; .qmd Jupyter deps not in target env
"""..."""

The marker is informational — it documents the justification and silences the hook. Removing it later will cause the hook to block edits, which is intentional (forces re-justification).

Directory Structure

Flat Layout

For small projects with <10 scripts on a single topic:

project/
  R/                          # Shared R helpers
  python/                     # Shared Python helpers
  scripts/
    01_analysis.qmd
    02_plots.qmd
    exploratory/              # One-off analyses
  data/                       # External/immutable inputs only
  outs/
    01_analysis/              # Outputs from script 01
      mdata.rds
      01_analysis.html        # Rendered HTML
      BUILD_INFO.txt
    02_plots/
      volcano.pdf
      02_plots.html
      BUILD_INFO.txt
    exploratory/

Sectioned Layout

For larger projects with multiple analytical threads:

project/
  R/                          # Shared R helpers (project-level)
  python/                     # Shared Python helpers (project-level)
  scripts/
    phosphoproteomics/
      01_analysis.qmd
      02_volcano_plots.qmd
    transcriptomics/
      01_heatmaps.qmd
    exploratory/
  data/
    gene_naming/              # Shared external data
    phosphoproteomics/        # Section-specific external data
    transcriptomics/
  outs/
    phosphoproteomics/
      01_analysis/
      02_volcano_plots/
    transcriptomics/
      01_heatmaps/
    exploratory/
  .claude/
    PHOSPHOPROTEOMICS_PLAN.md
    TRANSCRIPTOMICS_PLAN.md

Each section has its own script numbering (starting at 01_). Sections may have one or more planning documents in .claude/.

Cluster Projects

Projects that submit SLURM jobs on the HPC cluster add two directories:

project/
  batch/                        # SLURM batch scripts (.sh) — tracked in git
  logs/                         # SLURM output files (slurm-*.out) — tracked in git
  scripts/
  data/
  outs/

Script format on the cluster: .py is the default — see the "Script Format by Environment" section above for the full rule (and the local .qmd default + override marker).

The `.py` + `.sh` Pairing Convention

Analysis logic and SLURM job configuration are always separate files:

.py script in scripts/<section>/ — contains all analysis logic, reads inputs, writes outputs, produces plots. Self-contained: can be run interactively, locally, or via SLURM. Uses PROJECT_ROOT from git, not hardcoded paths.
.sh batch script in batch/ — thin SLURM wrapper that sets resource requests, activates the conda environment, and calls python scripts/<section>/XX_script.py. Contains no analysis logic.

This separation means:

The .py script can be run directly (python scripts/annotation/05_mapping.py) for debugging, interactive development, or local execution
SLURM resources can be adjusted without touching analysis code
The .sh script is short and templated (see hpc skill for the template)

Choosing a Subdirectory

When creating a new script in a sectioned project:

Check the project's CLAUDE.md for a Script Subdirectories table listing each subdirectory and its scope
If the task clearly fits one subdirectory, use it
If ambiguous, ask the user which subdirectory to use before creating the script
If none of the existing subdirectories fit, propose creating a new one

When to Use Each Layout

Flat: Single topic, small scope, fewer than ~10 scripts
Sectioned: Multiple distinct analytical threads, especially when different people work on different sections

`data/` vs `outs/`

Rule: If your code produced it, it goes in outs/. If it came from anywhere else, it goes in data/. Scripts never write to data/.

Script Numbering

Assign the next available number when adding a script
Never renumber existing scripts when one is archived or deleted
In sectioned projects, numbering restarts at 01_ in each section

Letter Suffixes (a, b, c)

Use letter suffixes when a single topic requires multiple scripts. Common reasons:

User review needed between steps (e.g., threshold selection → module detection)
Different output types (e.g., main analysis .qmd + plotting companion .R)
Language split when R and Python steps cannot share a .qmd (see Cross-Language rule below)

Rules for lettered scripts:

Same topic, same number. A new topic gets a new number, not a new letter.
Shared output directory. All scripts in a lettered set write to outs/XX_topic_name/ — NOT outs/XXa_name/, outs/XXb_name/. The output dir uses the number without a letter.
The a script runs first. Letters imply execution order within the set.
Name the set consistently. 15a_wgcna_threshold.qmd, 15b_wgcna_modules.qmd, 15c_wgcna_plots.R — all share outs/15_wgcna_platynereis/.
Companion scripts (.R or .py alongside .qmd) are acceptable for lightweight tasks (plotting, utilities).

When to use a new number vs a letter:

New number: different analytical question, different input data, different topic
Letter suffix: same topic split across steps, same conceptual analysis

Script Lifecycle

Every analysis script declares its status:

.qmd — status field in YAML frontmatter
.py — Status: development line in the module docstring

When deprecating, note the replacement in the frontmatter (.qmd) or docstring (.py).

Planning documents remain the authoritative tracker of script status across the project.

Exploratory Directory

scripts/exploratory/ (or scripts/<section>/exploratory/) is for one-off analyses, quick tests, and feasibility checks:

No number prefixes or BUILD_INFO.txt required
Other scripts must never depend on exploratory outputs (one-way dependency: exploratory scripts can read from any section's outs/)
Can be cleaned out periodically without breaking anything
No planning document needed
Good candidates for promotion: if an exploratory script proves useful, promote it to a numbered script in the appropriate section

Input/Output Tracking

R example:

# --- Inputs (from other scripts) ---
mdata <- readRDS(here("outs/phosphoproteomics/01_analysis/mdata.rds"))
modules <- read_tsv(here("outs/phosphoproteomics/02_module_lists/modules.tsv"))

# --- Inputs (external data) ---
gene_names <- read_tsv(here("data/gene_naming/spongilla_gene_names_final.tsv"))

Python example:

# --- Inputs (from other scripts) ---
modules = pd.read_csv(PROJECT_ROOT / "outs/phosphoproteomics/02_module_lists/modules.tsv", sep="\t")

# --- Inputs (external data) ---
gene_names = pd.read_csv(PROJECT_ROOT / "data/gene_naming/spongilla_gene_names_final.tsv", sep="\t")

Reading the top of any script shows exactly what it depends on and which upstream scripts produced those files. No separate DAG documentation needed.

Provenance

Archive Before Overwrite

Python:

import shutil
from datetime import datetime

existing_items = [f for f in out_dir.iterdir() if f.name != "_archive"]
if existing_items:
    build_info = out_dir / "BUILD_INFO.txt"
    if build_info.exists():
        orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
    else:
        all_files = [f for f in out_dir.rglob("*") if f.is_file() and "_archive" not in str(f)]
        orig_time = datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files)) if all_files else datetime.now()

    archive_dir = out_dir / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
    archive_dir.mkdir(parents=True, exist_ok=True)
    for item in existing_items:
        shutil.move(str(item), str(archive_dir / item.name))
    print(f"Archived {len(existing_items)} items → {archive_dir.name}")

existing_files <- list.files(out_dir, full.names = TRUE)
existing_files <- existing_files[!file.info(existing_files)$isdir]
if (length(existing_files) > 0) {
  build_info <- file.path(out_dir, "BUILD_INFO.txt")
  if (file.exists(build_info)) {
    orig_time <- file.info(build_info)$mtime
  } else {
    orig_time <- max(file.info(existing_files)$mtime)
  }
  archive_dir <- file.path(out_dir, "_archive", format(orig_time, "%Y-%m-%d_%H%M%S"))
  dir.create(archive_dir, recursive = TRUE, showWarnings = FALSE)
  file.rename(existing_files, file.path(archive_dir, basename(existing_files)))
  message("Archived ", length(existing_files), " previous outputs → ", basename(archive_dir))
}

Notes:

Only files are archived, not subdirectories (so _archive/ itself is never moved)
The _archive/ directory accumulates over time; periodically clean old archives
This pattern applies to all script types (.qmd, .py, .R)

Git Hash

BUILD_INFO.txt

Every script writes a BUILD_INFO.txt to its output folder as its last action:

script: 01_analysis.qmd
commit: a1b2c3d
date: 2026-02-14 15:30:00
slurm_job_id: 6380027

slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
# ... in BUILD_INFO write block:
if slurm_job_id:
    f.write(f"slurm_job_id: {slurm_job_id}\n")

BUILD_INFO.txt lives in outs/ and is not tracked by git (since outs/ is in .gitignore).

Rendered HTML

Rendered .html output goes into outs/<script_name>/ alongside data outputs, keeping scripts/ clean.

See the quarto-docs skill for complete QMD templates with git hash and BUILD_INFO.txt chunks.

`.py` Analysis Script Template

#!/usr/bin/env python3
"""Short description of what this script does.

Input:  data/... (external), outs/.../file.tsv (from script XX)
Output: outs/section/XX_script_name/

Status: development
"""

import subprocess
import sys
from datetime import datetime
from pathlib import Path

import matplotlib
matplotlib.use("Agg")  # headless — saves to files, no display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# ── Setup ─────────────────────────────────────────────────────────────────────

PROJECT_ROOT = Path(
    subprocess.check_output(["git", "rev-parse", "--show-toplevel"], text=True).strip()
)
sys.path.insert(0, str(PROJECT_ROOT / "python"))

GIT_HASH = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"], text=True
).strip()
print(f"Git hash: {GIT_HASH}")

OUT_DIR = PROJECT_ROOT / "outs" / "section" / "XX_script_name"
OUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Archive previous outputs ─────────────────────────────────────────────────

import shutil

existing_items = [f for f in OUT_DIR.iterdir() if f.name != "_archive"]
if existing_items:
    build_info = OUT_DIR / "BUILD_INFO.txt"
    if build_info.exists():
        orig_time = datetime.fromtimestamp(build_info.stat().st_mtime)
    else:
        all_files = [
            f for f in OUT_DIR.rglob("*")
            if f.is_file() and "_archive" not in str(f)
        ]
        orig_time = (
            datetime.fromtimestamp(max(f.stat().st_mtime for f in all_files))
            if all_files else datetime.now()
        )
    archive_dir = OUT_DIR / "_archive" / orig_time.strftime("%Y-%m-%d_%H%M%S")
    archive_dir.mkdir(parents=True, exist_ok=True)
    for item in existing_items:
        shutil.move(str(item), str(archive_dir / item.name))
    print(f"Archived {len(existing_items)} items -> {archive_dir.name}")

# ── Inputs ────────────────────────────────────────────────────────────────────

# --- Inputs (from other scripts) ---
# upstream = pd.read_csv(PROJECT_ROOT / "outs/.../file.tsv", sep="\t")

# --- Inputs (external data) ---
# raw_data = pd.read_csv(PROJECT_ROOT / "data/.../file.tsv", sep="\t")

# ── Analysis step 1 ───────────────────────────────────────────────────────────
#
# Describe WHAT this step does and WHY — the analytical reasoning, not just
# code mechanics. What question does this step answer? What should the reader
# look for in the output? This replaces the markdown narrative from .qmd files.
#
# Each major section should have a block comment like this. Not every line
# needs a comment, but every analytical step needs context. Also annotate:
# - Critical lines (thresholds, assumptions, non-obvious logic)
# - Tricky or surprising code that would confuse a reader

# ... analysis code, plots saved to OUT_DIR ...

# ── BUILD_INFO ────────────────────────────────────────────────────────────────

slurm_job_id = os.environ.get("SLURM_JOB_ID", "")
with open(OUT_DIR / "BUILD_INFO.txt", "w") as f:
    f.write(f"script: scripts/section/XX_script_name.py\n")
    f.write(f"commit: {GIT_HASH}\n")
    f.write(f"date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    if slurm_job_id:
        f.write(f"slurm_job_id: {slurm_job_id}\n")
print("BUILD_INFO.txt written")

Key features:

matplotlib.use("Agg") for headless rendering (no display server needed)
Git hash captured at start, printed to stdout, written to BUILD_INFO.txt
Archive-before-overwrite preserves previous outputs
Docstring with status, inputs, outputs serves as the script's documentation
All output goes to outs/, all reads from data/ or upstream outs/
PROJECT_ROOT from git, not hardcoded paths
Stdout serves as the execution log (redirect with python script.py | tee log.txt)

Helper Functions

Project-Level

Shared helper functions live in R/ and python/ at the project root:

# R scripts load helpers with:
source(here("R/gene_name_helpers.R"))

# Python scripts load helpers with:
import sys
sys.path.insert(0, str(PROJECT_ROOT / "python"))
from gene_name_helpers import normalize_name

Rules:

Do not version function names (make_gene_short, not make_gene_short_v2). Fix functions in place; git tracks the history. If a function's interface genuinely changes (different inputs/outputs/purpose), give it a descriptive name reflecting what it does, not when it was written.
Do not duplicate the same function in both R and Python within a project. Each function lives in one language. If a script in the other language needs that logic, rewrite it once in the new language and retire the old one.

Cross-Project

Functions shared across multiple projects live in ~/lib/R/ and ~/lib/python/:

source("~/lib/R/plotting_helpers.R")

When a project-level function proves useful across 2+ projects, promote it to ~/lib/. The ~/lib/ directory should be a git repository for version tracking.

Future: When distributing functions to collaborators, graduate shared functions into installable R/Python packages.

Cross-Language Data Interchange

When data produced by an R script will be read by a Python script (or vice versa), use Parquet:

Smaller than TSV, preserves column types, fast in both languages
R: arrow::write_parquet() / arrow::read_parquet()
Python: pd.to_parquet() / pd.read_parquet()

Avoid .rds (R-only) or .pkl (Python-only) for data that crosses the language boundary. Within a single language, native formats (.rds for R) are fine.

Cross-Language Scripts

Prefer single-language .qmd files. When a script needs both R and Python, split into lettered scripts (e.g., XXa_ in Python, XXb_ in R) that communicate through files in outs/XX_topic/.

Related Skills

musserlab/tree-formatting

development

VerifiedTrustedCommunity

Phylogenetic tree visualization and formatting with ggtree (R) or iTOL (web). Use when rendering a phylogenetic tree as a figure, choosing tree layout, coloring branches or labels by taxonomy, collapsing clades, displaying support values, or adding overlays to a tree. Do NOT load for tree inference (use protein-phylogeny skill) or domain annotation (future separate skill).

1SKILL.mdUpdated May 9, 2026

musserlab/tree-formatting

musserlab/security-setup

development

VerifiedTrustedCommunity

Configure and manage Claude Code security protections for sensitive files, credentials, and data. Use when the user invokes /security-setup to set up or modify protections against unauthorized file access, credential exposure, or sensitive data leaks.

1SKILL.mdUpdated May 9, 2026

musserlab/security-setup

musserlab/r-renv

testing

VerifiedTrustedCommunity

R renv package management for data science projects. Use when working with renv (renv.lock, renv::restore, renv::snapshot) in R analysis projects. Do NOT load for projects that do not use R or renv.

1SKILL.mdUpdated May 9, 2026

musserlab/r-plotting-style

development

VerifiedTrustedCommunity

R ggplot2 plotting conventions and theme. Use when creating, modifying, or styling ggplot2 plots in R, or when adjusting plot themes, colors, labels, or formatting.

1SKILL.mdUpdated May 9, 2026

musserlab/r-plotting-style

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/musserlab/lab-claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r lab-claude-skills/skills/script-organization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

musserlab/lab-claude-skills

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

musserlab/script-organization

$ install --global

Security Scan Results

SKILL.md

Script Organization and Reproducibility

Script Format by Environment

Directory Structure

Flat Layout

Sectioned Layout

Cluster Projects

The .py + .sh Pairing Convention

Choosing a Subdirectory

When to Use Each Layout

data/ vs outs/

Script Numbering

Letter Suffixes (a, b, c)

Script Lifecycle

Exploratory Directory

Input/Output Tracking

Provenance

Archive Before Overwrite

Git Hash

BUILD_INFO.txt

Rendered HTML

.py Analysis Script Template

Helper Functions

Project-Level

Cross-Project

Cross-Language Data Interchange

Cross-Language Scripts

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/r-renv

musserlab/r-plotting-style

musserlab/script-organization

$ install --global

Security Scan Results

SKILL.md

Script Organization and Reproducibility

Script Format by Environment

Directory Structure

Flat Layout

Sectioned Layout

Cluster Projects

The .py + .sh Pairing Convention

Choosing a Subdirectory

When to Use Each Layout

data/ vs outs/

Script Numbering

Letter Suffixes (a, b, c)

Script Lifecycle

Exploratory Directory

Input/Output Tracking

Provenance

Archive Before Overwrite

Git Hash

BUILD_INFO.txt

Rendered HTML

.py Analysis Script Template

Helper Functions

Project-Level

Cross-Project

Cross-Language Data Interchange

Cross-Language Scripts

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/r-renv

musserlab/r-plotting-style

The `.py` + `.sh` Pairing Convention

`data/` vs `outs/`

`.py` Analysis Script Template

The `.py` + `.sh` Pairing Convention

`data/` vs `outs/`

`.py` Analysis Script Template