Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ericmjl/ml-experimentation

Name: ml-experimentation
Author: ericmjl

skills/ml-experimentation/SKILL.md

npx skillsauth add ericmjl/skills ml-experimentation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Experimentation

This skill guides a hypothesis-driven ML experiment life cycle: planning, fast iteration, script execution, targeted logging, journaling, diagnostic visualization, and scientific report writing.

Usage

Use this skill when the user wants to run an ML experiment, test a model or idea, or write up experiment results. First decide: new experiment (different question → new experiment directory) or new run (same question, tweaks → new run under runs/). See references/experiment-setup.md for that disambiguation, hypothesis scoping, and the fast-iteration checklist.

Requirements

Python 3.11+ with uv or pixi for running scripts: uv run script.py or, when pixi is the environment manager, pixi run python script.py (pixi reads pyproject.toml or pixi.toml).
Dependencies declared via PEP723 inline script metadata in each script (or, with pixi, in pyproject.toml / pixi.toml).
Respect the user's training framework (PyTorch, JAX, TensorFlow, etc.). Run scripts in a GPU-enabled environment wherever possible: with uv use GPU-enabled deps (e.g. JAX GPU extras, PyTorch via [[tool.uv.index]] CUDA index in the script block); with pixi use a GPU-enabled environment defined in pyproject.toml or pixi.toml. Fall back to CPU only when GPU is unavailable. See references/script-patterns.md.

What It Does

Planning – Decide new experiment (different question) vs new run (same question, tweaks). Extract one testable hypothesis, define success criteria, identify metrics to log; create experiment directory and JOURNAL.md (new experiment) or add a run under runs/ (new run).
De-risking – Keep runs under ~60 seconds; scale down data, epochs, or model size before longer runs
Scripts – Disposable scripts with PEP723 metadata, run with uv run or pixi run (GPU-enabled environment preferred)
Logging – Log only what the hypothesis needs; avoid verbose or redundant logs
Journal – Read JOURNAL.md before each action; record observations, anomalies, hunches
Plots and report – Generate plots from logged data only; write a scientific report (abstract, intro, methods, results, discussion, conclusion) with no hallucination or editorialization

How It Works

Phase 1: Experiment Planning

New experiment vs new run: If the user is answering a different question → start a new experiment (new directory, JOURNAL.md, canonical tree). If they are tweaking to answer the same question → add a new run under the existing experiment; name it full ISO datetime + descriptive string YYYY-MM-DDTHH-MM-SS-<descriptive-string> (e.g. runs/2025-02-03T09-00-00-retry). See references/experiment-setup.md.
Extract a single, testable hypothesis from the user’s goal. Reject vague or multi-part goals; narrow to one claim that can be verified.
Define success and failure criteria before running anything.
List metrics to log that map directly to the hypothesis and criteria.
For a new experiment: Create an experiment directory and add JOURNAL.md (see Phase 5). Use the canonical tree: experiment → runs/YYYY-MM-DDTHH-MM-SS-<descriptive-string>/ → logs/, plots/, checkpoints/, data/. For a new run: Create only the new run directory under runs/ (e.g. runs/2025-02-03T09-00-00-retry/) with logs/, plots/, checkpoints/, data/.

Phase 2: De-risking Loop (Fast Iteration)

Target: single run completes in under 60 seconds. If a run would take longer, scale down first.
Scale down by: smaller data subset (representative, not just “first N”), fewer epochs, simpler or smaller model, or fewer evaluation steps.
Sanity-check data loading, training loop, and evaluation on the scaled-down setup before committing to longer runs.
If something would take > 2 minutes, find a proxy that finishes in under 1 minute.
Before each run, verify the fast-iteration checklist in references/experiment-setup.md.
Each run is a directory under runs/ named full ISO datetime + descriptive string YYYY-MM-DDTHH-MM-SS-<descriptive-string> (e.g. runs/2025-02-02T14-30-00-de-risk, runs/2025-02-02T15-00-00-full, runs/2025-02-03T09-00-00-retry). Keep a running log; never overwrite an existing run. See references/experiment-setup.md for the canonical tree.
To ignore failed or irrelevant runs without deleting: list them in IGNORED_RUNS.md (or JOURNAL.md “Ignored runs”). See references/experiment-setup.md.

Phase 3: Script Execution

All Python scripts use PEP723 inline script metadata and are run with uv run script.py or, when pixi is the environment manager, pixi run python script.py (pixi uses pyproject.toml or pixi.toml). Always run train/eval in a GPU-enabled environment when possible (uv: CUDA index or jax[cuda*] in script block; pixi: GPU-enabled env in pyproject.toml or pixi.toml).
Run scripts with CWD = experiment directory so paths like runs/2025-02-02T14-30-00-de-risk, runs/2025-02-02T15-00-00-full are relative to the experiment. Scripts accept only the descriptive name (e.g. uv run train.py de-risk or pixi run python train.py de-risk); datetime is auto-calculated. The training script (train.py) creates the run directory (logs/, plots/, checkpoints/, data/) so the experiment is self-contained—no external scaffold; see references/script-patterns.md for the Typer-based train scaffold.
Longer runs: When executing a run that will take more than a minute or two, pass a custom timeout to the shell/bash tool used to run the script (e.g. the tool’s timeout parameter), otherwise the tool may hit its default execution timeout and the run may be killed before completion.
Scripts are disposable: they are experiment artifacts, not production code. Include any synthetic data generation scripts in <experiment>/ (e.g. generate_data.py); run them with CWD = experiment directory.
Train and eval scripts: Use the user's chosen framework and prefer GPU-enabled dependencies (JAX GPU extras, PyTorch via [[tool.uv.index]] CUDA index when using uv; GPU deps in pyproject.toml/pixi.toml when using pixi) so runs are performant; fall back to CPU only when GPU is unavailable.
Use the patterns in references/script-patterns.md for data loading, training, and evaluation.

Phase 4: Logging (Targeted)

Log only what the hypothesis and success criteria need.
Required: metrics that determine success/failure (e.g. loss, accuracy, F1).
Optional: diagnostics that could explain unexpected results (e.g. per-batch stats, timing).
Avoid: verbose debug logs, full model state, gradients (unless the hypothesis is about them).
Use loguru for logging; write to the run’s logs/ directory (e.g. runs/2025-02-02T14-30-00-de-risk/logs/train.log, runs/2025-02-02T15-00-00-full/logs/eval.log). See references/logging-guide.md.

Phase 5: JOURNAL.md Protocol

Before each action (next run, next script, next analysis): read JOURNAL.md.
Record: observations, anomalies, unexpected behavior, hunches, follow-ups.
Add a timestamp to each entry.
Use tags: [WEIRD], [HUNCH], [TODO], [RESOLVED] so entries are scannable.
The journal is the primary memory for the experiment; use it to decide the next step and to inform the Discussion section of the final report.

Phase 6: Diagnostic Plots and Reporting

Plots

Before plotting: Read IGNORED_RUNS.md (and JOURNAL.md’s “Ignored runs” section if present); exclude any listed runs from plots.
Generate only plots that correspond to logged data. Do not invent or assume data.
Examples: training curves (loss/accuracy vs step/epoch), metric distributions, comparison bars.
Save plots in the run’s plots/ directory (e.g. runs/2025-02-02T15-00-00-full/plots/loss_curve.webp), generated from that run’s logs/.
If you log epoch and loss to e.g. runs/2025-02-02T15-00-00-full/logs/train.log, generate runs/2025-02-02T15-00-00-full/plots/loss_curve.webp from that log; do not plot quantities that were not logged.

Scientific Report

Before writing: Exclude runs listed in IGNORED_RUNS.md or JOURNAL.md “Ignored runs” from the report narrative and figures; do not delete those runs from disk.
Structure: Abstract, Introduction, Methods, Results, Discussion, Conclusion.
No hallucination: only refer to data that was actually collected (cite log files, tables, figures).
No editorialization: state what happened and what the data show; do not state what you wish had happened.
Include highlights from JOURNAL.md (e.g. anomalies, resolved issues) in Discussion.
Use the template in references/report-template.md.

Guardrails

No long runs without justification – If a script would take > 2 minutes, either get explicit confirmation or propose a scaled-down run that finishes in under 1 minute. When running a longer run, pass a custom timeout to the shell/bash tool so it does not hit the default execution timeout.
Journal-first – Always read JOURNAL.md before suggesting or taking the next action.
Data-backed plots only – Never generate a plot without corresponding logged data; every curve or point must come from a specified log or file.
Report factuality – Every claim in the report must be tied to a specific log file, table, or figure; no unsupported claims.

ericmjl/ml-experimentation

skills/ml-experimentation/SKILL.md

Conduct machine learning experiments from planning through evaluation and report writing. Use when running ML experiments, testing hypotheses, training models, or writing up results. Covers single-hypothesis scoping, fast iteration loops, targeted logging, JOURNAL.md protocol, data-backed diagnostic plots, and scientific report writing.

34 stars

testing

Updated May 31, 2026

$ install --global

skillsauth

npx skillsauth add ericmjl/skills ml-experimentation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 31, 2026, 2:55 AM52.1s5 files scanned

SKILL.md

name:: ml-experimentation
description:: Conduct machine learning experiments from planning through evaluation and report writing. Use when running ML experiments, testing hypotheses, training models, or writing up results. Covers single-hypothesis scoping, fast iteration loops, targeted logging, JOURNAL.md protocol, data-backed diagnostic plots, and scientific report writing.
license:: MIT

ML Experimentation

This skill guides a hypothesis-driven ML experiment life cycle: planning, fast iteration, script execution, targeted logging, journaling, diagnostic visualization, and scientific report writing.

Usage

Requirements

Python 3.11+ with uv or pixi for running scripts: uv run script.py or, when pixi is the environment manager, pixi run python script.py (pixi reads pyproject.toml or pixi.toml).
Dependencies declared via PEP723 inline script metadata in each script (or, with pixi, in pyproject.toml / pixi.toml).
Respect the user's training framework (PyTorch, JAX, TensorFlow, etc.). Run scripts in a GPU-enabled environment wherever possible: with uv use GPU-enabled deps (e.g. JAX GPU extras, PyTorch via [[tool.uv.index]] CUDA index in the script block); with pixi use a GPU-enabled environment defined in pyproject.toml or pixi.toml. Fall back to CPU only when GPU is unavailable. See references/script-patterns.md.

What It Does

Planning – Decide new experiment (different question) vs new run (same question, tweaks). Extract one testable hypothesis, define success criteria, identify metrics to log; create experiment directory and JOURNAL.md (new experiment) or add a run under runs/ (new run).
De-risking – Keep runs under ~60 seconds; scale down data, epochs, or model size before longer runs
Scripts – Disposable scripts with PEP723 metadata, run with uv run or pixi run (GPU-enabled environment preferred)
Logging – Log only what the hypothesis needs; avoid verbose or redundant logs
Journal – Read JOURNAL.md before each action; record observations, anomalies, hunches
Plots and report – Generate plots from logged data only; write a scientific report (abstract, intro, methods, results, discussion, conclusion) with no hallucination or editorialization

How It Works

Phase 1: Experiment Planning

New experiment vs new run: If the user is answering a different question → start a new experiment (new directory, JOURNAL.md, canonical tree). If they are tweaking to answer the same question → add a new run under the existing experiment; name it full ISO datetime + descriptive string YYYY-MM-DDTHH-MM-SS-<descriptive-string> (e.g. runs/2025-02-03T09-00-00-retry). See references/experiment-setup.md.
Extract a single, testable hypothesis from the user’s goal. Reject vague or multi-part goals; narrow to one claim that can be verified.
Define success and failure criteria before running anything.
List metrics to log that map directly to the hypothesis and criteria.
For a new experiment: Create an experiment directory and add JOURNAL.md (see Phase 5). Use the canonical tree: experiment → runs/YYYY-MM-DDTHH-MM-SS-<descriptive-string>/ → logs/, plots/, checkpoints/, data/. For a new run: Create only the new run directory under runs/ (e.g. runs/2025-02-03T09-00-00-retry/) with logs/, plots/, checkpoints/, data/.

Phase 2: De-risking Loop (Fast Iteration)

Target: single run completes in under 60 seconds. If a run would take longer, scale down first.
Scale down by: smaller data subset (representative, not just “first N”), fewer epochs, simpler or smaller model, or fewer evaluation steps.
Sanity-check data loading, training loop, and evaluation on the scaled-down setup before committing to longer runs.
If something would take > 2 minutes, find a proxy that finishes in under 1 minute.
Before each run, verify the fast-iteration checklist in references/experiment-setup.md.
Each run is a directory under runs/ named full ISO datetime + descriptive string YYYY-MM-DDTHH-MM-SS-<descriptive-string> (e.g. runs/2025-02-02T14-30-00-de-risk, runs/2025-02-02T15-00-00-full, runs/2025-02-03T09-00-00-retry). Keep a running log; never overwrite an existing run. See references/experiment-setup.md for the canonical tree.
To ignore failed or irrelevant runs without deleting: list them in IGNORED_RUNS.md (or JOURNAL.md “Ignored runs”). See references/experiment-setup.md.

Phase 3: Script Execution

All Python scripts use PEP723 inline script metadata and are run with uv run script.py or, when pixi is the environment manager, pixi run python script.py (pixi uses pyproject.toml or pixi.toml). Always run train/eval in a GPU-enabled environment when possible (uv: CUDA index or jax[cuda*] in script block; pixi: GPU-enabled env in pyproject.toml or pixi.toml).
Run scripts with CWD = experiment directory so paths like runs/2025-02-02T14-30-00-de-risk, runs/2025-02-02T15-00-00-full are relative to the experiment. Scripts accept only the descriptive name (e.g. uv run train.py de-risk or pixi run python train.py de-risk); datetime is auto-calculated. The training script (train.py) creates the run directory (logs/, plots/, checkpoints/, data/) so the experiment is self-contained—no external scaffold; see references/script-patterns.md for the Typer-based train scaffold.
Longer runs: When executing a run that will take more than a minute or two, pass a custom timeout to the shell/bash tool used to run the script (e.g. the tool’s timeout parameter), otherwise the tool may hit its default execution timeout and the run may be killed before completion.
Scripts are disposable: they are experiment artifacts, not production code. Include any synthetic data generation scripts in <experiment>/ (e.g. generate_data.py); run them with CWD = experiment directory.
Train and eval scripts: Use the user's chosen framework and prefer GPU-enabled dependencies (JAX GPU extras, PyTorch via [[tool.uv.index]] CUDA index when using uv; GPU deps in pyproject.toml/pixi.toml when using pixi) so runs are performant; fall back to CPU only when GPU is unavailable.
Use the patterns in references/script-patterns.md for data loading, training, and evaluation.

Phase 4: Logging (Targeted)

Log only what the hypothesis and success criteria need.
Required: metrics that determine success/failure (e.g. loss, accuracy, F1).
Optional: diagnostics that could explain unexpected results (e.g. per-batch stats, timing).
Avoid: verbose debug logs, full model state, gradients (unless the hypothesis is about them).
Use loguru for logging; write to the run’s logs/ directory (e.g. runs/2025-02-02T14-30-00-de-risk/logs/train.log, runs/2025-02-02T15-00-00-full/logs/eval.log). See references/logging-guide.md.

Phase 5: JOURNAL.md Protocol

Before each action (next run, next script, next analysis): read JOURNAL.md.
Record: observations, anomalies, unexpected behavior, hunches, follow-ups.
Add a timestamp to each entry.
Use tags: [WEIRD], [HUNCH], [TODO], [RESOLVED] so entries are scannable.
The journal is the primary memory for the experiment; use it to decide the next step and to inform the Discussion section of the final report.

Phase 6: Diagnostic Plots and Reporting

Plots

Before plotting: Read IGNORED_RUNS.md (and JOURNAL.md’s “Ignored runs” section if present); exclude any listed runs from plots.
Generate only plots that correspond to logged data. Do not invent or assume data.
Examples: training curves (loss/accuracy vs step/epoch), metric distributions, comparison bars.
Save plots in the run’s plots/ directory (e.g. runs/2025-02-02T15-00-00-full/plots/loss_curve.webp), generated from that run’s logs/.
If you log epoch and loss to e.g. runs/2025-02-02T15-00-00-full/logs/train.log, generate runs/2025-02-02T15-00-00-full/plots/loss_curve.webp from that log; do not plot quantities that were not logged.

Scientific Report

Before writing: Exclude runs listed in IGNORED_RUNS.md or JOURNAL.md “Ignored runs” from the report narrative and figures; do not delete those runs from disk.
Structure: Abstract, Introduction, Methods, Results, Discussion, Conclusion.
No hallucination: only refer to data that was actually collected (cite log files, tables, figures).
No editorialization: state what happened and what the data show; do not state what you wish had happened.
Include highlights from JOURNAL.md (e.g. anomalies, resolved issues) in Discussion.
Use the template in references/report-template.md.

Guardrails

No long runs without justification – If a script would take > 2 minutes, either get explicit confirmation or propose a scaled-down run that finishes in under 1 minute. When running a longer run, pass a custom timeout to the shell/bash tool so it does not hit the default execution timeout.
Journal-first – Always read JOURNAL.md before suggesting or taking the next action.
Data-backed plots only – Never generate a plot without corresponding logged data; every curve or point must come from a specified log or file.
Report factuality – Every claim in the report must be tied to a specific log file, table, or figure; no unsupported claims.

Related Skills

ericmjl/remotion-video

development

VerifiedTrustedCommunity

Create animated videos using Remotion from topics, product URLs, Google reviews, talking-head videos, or CSV data. Supports 5 video types: educational explainers, product launch demos, testimonial/social proof, avatar video overlays, and data visualization dashboards. Each follows a 2-step workflow: research/scrape/analyze then design and animate with spring animations, SVG diagrams, and count-up effects. Requires the Remotion best practices skill (install with `npx skills add remotion-dev/skills`). Use when the user asks to create a Remotion video, explainer video, educational video, product demo video, testimonial video, video with animated overlays, data visualization video, animated dashboard, or short-form vertical video for mobile.

35SKILL.mdUpdated Jun 1, 2026

ericmjl/remotion-video

ericmjl/youtube

development

VerifiedTrustedCommunity

Comprehensive YouTube operations using yt-dlp - download videos/audio, extract transcripts and subtitles, get metadata, work with playlists, download thumbnails, and inspect available formats. Use this for any YouTube content processing task.

34SKILL.mdUpdated May 31, 2026

ericmjl/youtube-ingestion

data-ai

VerifiedTrustedCommunity

Ingest YouTube videos into the vault. Triggers when user pastes a YouTube URL (youtube.com/watch or youtu.be). Fetches transcript using yt-dlp, extracts metadata, creates transcript note and summary note. User may provide additional context about the video.

34SKILL.mdUpdated May 31, 2026

ericmjl/youtube-ingestion

ericmjl/voss-advisor

tools

VerifiedTrustedCommunity

Advanced negotiation and communication advisor grounded in Chris Voss's tactical empathy methodology (Never Split the Difference, The Black Swan Group). Use this skill whenever the user needs help with any interpersonal situation involving influence, persuasion, or navigating difficult dynamics. This includes but is not limited to: analyzing conversations, call transcripts, or email threads; preparing for negotiations (salary, vendor, client, partner); drafting tactful responses; handling pushback, objections, or conflict; navigating difficult workplace conversations; preparing for performance reviews or raises; buying a car, house, or any big purchase; dealing with landlords, contractors, or service providers; resolving personal disagreements; practicing negotiation through role-play; or any situation where the user says things like "how should I respond to this", "they're pushing back", "I need to have a tough conversation", "how do I ask for...", "they ghosted me", "I'm not sure how to handle this person", "counter-offer", "pricing", "deal", "objection", or "difficult conversation". Activate broadly — most interpersonal communication benefits from tactical empathy whether or not the user frames it as "negotiation." This skill integrates FBI hostage negotiation techniques (93% success rate) with behavioral economics (Kahneman's Prospect Theory) and neuroscience (amygdala hijacking, loss aversion).

34SKILL.mdUpdated May 31, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ericmjl/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/skills/ml-experimentation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ericmjl/skills

34 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT