codex/skills/saddle-up/SKILL.md
Continuously evaluate and improve AGENTS.md-style harness instructions through explicit-trigger OpenCode loops with an explicit model. Use when you want recurring harness reliability runs, especially for Gemini 2.5 Pro/OpenCode harness tuning, clean-repo eval cycles, curated exact-output probes, automatic eval-branch commits and PR updates for passing harness/doc changes, and external-blocker detection or regression auto-revert without scheduler/cron automation.
npx skillsauth add tkersey/dotfiles saddle-upInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run an explicit-trigger continuous loop that updates a target harness, evaluates it against a fixed suite, and promotes passing changes through a dedicated branch + PR flow. Recent session-mining updates bias the loop toward Gemini 2.5 Pro failure modes: exact-output envelopes, proof honesty, failure-recovery wording, workdir discipline, and immediate external-blocker reporting.
Use an explicit model on every run, and prefer a clean target repo or a docs-only worktree.
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py run \
--repo /path/to/target-repo \
--harness-path AGENTS.md \
--model google/gemini-2.5-pro
For a bounded debugging pass that cannot hang forever:
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py run \
--repo /path/to/target-repo \
--harness-path AGENTS.md \
--model google/gemini-2.5-pro \
--no-commit \
--max-cycles 1 \
--opencode-timeout-seconds 180
If the improver path is the problem and you want to evaluate the current harness as-is:
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py run \
--repo /path/to/target-repo \
--harness-path AGENTS.md \
--model google/gemini-2.5-pro \
--skip-improve \
--no-commit \
--max-cycles 1 \
--opencode-timeout-seconds 180
If you want a fast proof of the curated harness only, without replay cases:
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py run \
--repo /path/to/target-repo \
--harness-path AGENTS.md \
--model google/gemini-2.5-pro \
--skip-improve \
--case-source curated \
--case-parallelism 4 \
--no-commit \
--max-cycles 1 \
--opencode-timeout-seconds 600
Refresh a Gemini-tuned suite before the next run:
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py replay-refresh \
--repo /path/to/target-repo \
--harness-path AGENTS.md \
--model google/gemini-2.5-pro \
--refresh-curated
Stop gracefully from another shell:
touch /path/to/target-repo/.saddle-up/STOP
Inspect state:
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py status \
--repo /path/to/target-repo
Refresh replay cases from OpenCode prompt history (seq opencode-prompts):
uv run --with pyyaml codex/skills/saddle-up/scripts/saddle_up.py replay-refresh \
--repo /path/to/target-repo \
--model google/gemini-2.5-pro
opencode availability, and no pre-existing non-doc changes that would poison the docs-only gate..saddle-up/ files if missing, using model-aware defaults when the target is Gemini 2.5 Pro.not run honesty, retry-path wording, workdir discipline, anti-drift, and external hard stops.AGENTS.md churn: rerun all curated cases immediately after the improver, keep the diff only if that curated gate stays green, and otherwise revert the protected retry/workdir/external-blocker/not run rule drift before mixed eval can count it.>=80% by default) and docs-scope write policy.saddle-up/eval and open/update PR.threshold: 0.80stability_window: 3 consecutive passesopencode_timeout_seconds default:
600 for google/gemini-2.5-pro180 for other profiles80% curated / 20% replay60% curated / 40% replay.saddle-up/STOP (override with --stop-file)max_cycles: unbounded unless setsaddle-up/evalreplay-refresh --refresh-curated reseeds the curated suite from the current model profilerun and status read/write these files under the target repo:
.saddle-up/suite.yaml.saddle-up/scoring.yaml.saddle-up/state.yaml.saddle-up/runs.jsonlSchema details:
.saddle-up/* state files.AGENTS.md edits must clear a dedicated curated gate before mixed eval can count them.run.yaml import fails, run with uv run --with pyyaml ....opencode run, first verify whether the model-aware default is simply too low for the current model; for Gemini 2.5 Pro the default is 600 seconds, and you can still override it explicitly with --opencode-timeout-seconds.--skip-improve to evaluate the current harness without another rewrite attempt.AGENTS.md needs more literal rewrites; inspect the reverted rule IDs in status/runs.jsonl first.--case-source curated to prove the curated harness independently before spending more cycles on replay prompts.--case-parallelism so the exact-output checks can run concurrently on the same model.openrouter/google/gemini-2.5-pro hits credit or max_tokens failures, switch to direct google/gemini-2.5-pro before spending more harness cycles.--no-commit --max-cycles 1.external_blocker, clear the provider/auth/network issue first; do not keep cycling a blocked harness.replay-refresh --model google/gemini-2.5-pro --refresh-curated to restore the Gemini-focused suite..saddle-up/STOP) or interrupt with Ctrl+C.gh auth fails, run gh auth login before enabling PR automation.testing
Use before local patching when bugs, regressions, malformed state, crashes, parser failures, migrations, cache drift, protocol problems, compatibility requests, tolerant readers, fallbacks, coercions, retries, catch-and-continue logic, or local workarounds may broaden accepted invalid state.
testing
Use for bug reports, PR/issue prose, reviewer comments, user diagnoses, generated summaries, memories, retrieved context, public tracker context, claimed root causes, proposed fixes, fake-minimal repro risk, or any investigation where natural-language context could anchor the implementation scope.
development
Use when non-trivial work needs Challenge Escalation, latent-intelligence activation, frame-market selection, doctrine operators, dominant-move selection, ablation/surface-tax judgment, reification, review comment law, negative capability, route receipts, or proof-bearing refusal to mutate.
development
Apply Algebra-Driven Design. Use for ADD, denotational design, combinator models, law-driven architecture, domain algebra, property tests, codebase modeling, event sourcing, workflow design, or agentic skill design. If the canonical bundle is unavailable, use this wrapper as the minimal ADD kernel and report the missing bundle path.