skills/experiment-runbook-discipline/exports/openai/SKILL.md
Plan, launch, monitor, and document long-running experiments or validation sweeps with smallest-real-data smoke scopes, fresh experiment prefixes, live status artifacts, rolling logs, and promotion to full runs only after explicit pass criteria are met.
npx skillsauth add balandongiv/agent-skillbook experiment-runbook-disciplineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to keep long-running experiments observable, reproducible, and easy to audit later. Treat every substantial run as an investigation with durable artifacts, not as a terminal session that disappears once the process ends. The goal is not only to start the job, but to make it easy for another person or agent to answer four questions at any time: what is running, what logic version it uses, whether it is progressing, and how the final metrics turned out. When the user cares about true pipeline readiness, prefer the smallest real-data smoke scope first and promote only after explicit pass criteria are satisfied.
If the user wants to validate a pipeline, experiment stack, or result-producing workflow, use real data first whenever it is available and safe to use. Synthetic data is excellent for unit tests and regression coverage, but it is not enough to claim that the real file layout, annotations, caches, or dependency boundaries work correctly.
Create a markdown investigation note before the run starts. Record the dataset path, subject list or scope, experiment goal, chosen prefix, and whether existing outputs may be reused or must be force-rerun.
Use a fresh experiment prefix whenever logic changes in a way that could affect results. Do not mix outputs from different logic versions under the same prefix.
If the environment uses editable local packages, record that fact in the investigation note and include the relevant local repo path or package name in the run scope. If a failure originates inside that dependency, patch the dependency, then rerun the appropriate smoke scope before promoting the broader run.
For any run that may take meaningful time, create or reuse a runner that writes:
If the run is long enough to justify background execution, launch it in the background and immediately verify that it is actually progressing.
Do not assume a run is healthy because a process exists. Confirm that completed counts, timestamps, or other progress indicators are advancing.
A run is not complete just because files were written. Confirm the expected summary files exist and report the actual metrics from them.
Confirm the dataset path and the exact scope of the run. Decide whether the task is:
If the scope is unclear, infer the smallest safe real-data scope first and promote to a larger run only after the smaller scope is clean.
Write the investigation note before launching anything. Capture:
Update this note during the run and after completion.
Pick a prefix that clearly separates this run from older artifacts. If logic changed, increment the prefix even if the dataset and scope are unchanged. Treat old outputs as historical evidence, not as inputs to the new conclusion.
Make sure the run emits at least:
Tell the user exactly which files can be watched live.
Launch the run in the foreground or background as appropriate. Then verify all of the following:
If the run goes quiet for too long, inspect the log and process state before restarting.
When the run finishes, check for the expected outputs such as:
Then extract and report the key metrics from those outputs. Do not report success without citing the final artifacts that justify it.
If you changed scoring, preprocessing, evaluation logic, or any other result-affecting code, do all of the following:
tools
One-sentence description of what this skill does and when to use it.
tools
One-sentence description of what this skill does and when to use it.
documentation
Review per-subject performance to identify likely outliers, distinguish bad data from difficult but valid cases, and document whether subject exclusion is justified before any filtered rerun.
documentation
Review per-subject performance to identify likely outliers, distinguish bad data from difficult but valid cases, and document whether subject exclusion is justified before any filtered rerun.