Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

SethGammon/experiment

Name: experiment
Author: SethGammon

skills/experiment/SKILL.md

npx skillsauth add SethGammon/Citadel experiment

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/experiment — Metric-Driven Optimization Loop

Inputs

The user provides three things:

scope: Files to modify (glob pattern, e.g., "src/api/**/*.ts")
metric: Shell command that outputs a single number (e.g., npm run build 2>&1 | tail -1 | grep -oP '\d+')
budget: Iteration cap (default: 5) or time cap (e.g., "10 minutes")

If any input is missing, ask for it. The metric MUST output a single number to stdout.

Protocol

Step 1: BASELINE

Stash any uncommitted changes (restore on exit)
Run the metric command. Record the baseline value.
Determine direction: does lower = better (bundle size, error count) or higher = better (FPS, test count)? Ask the user if ambiguous.
Log: Baseline: {value} ({metric command})

Step 2: ITERATE

For each iteration (up to budget):

Create isolation: Spawn a sub-agent in a worktree (isolation: "worktree")
Propose change: The agent modifies files within scope to improve the metric. Provide context: baseline value, metric direction, scope, what previous iterations tried.
Measure: Run the metric command in the worktree (via node scripts/run-with-timeout.js 300)
Gate: Run typecheck (also via timeout wrapper). If it fails, discard immediately.
Evaluate:
- Improved? → KEEP. Merge the worktree branch. New baseline = new value.
- Same or worse? → DISCARD. Delete the worktree.

Log iteration:

Iteration {N}: {value} ({delta from baseline}) → {KEEP|DISCARD}
Change: {one-line description of what was tried}

Step 3: CONVERGENCE CHECK

After each iteration, check:

Local optimum: Last 3 iterations all discarded → stop ("no more improvements found")
Diminishing returns: Last kept improvement was < 0.5% → stop ("diminishing returns")
Budget exhausted: Iteration count or time exceeded → stop

Step 4: REPORT

Write results to .planning/research/experiment-{slug}.md:

# Experiment: {Description}

> Metric: `{command}`
> Direction: {lower|higher} is better
> Scope: {glob pattern}
> Budget: {N iterations}
> Date: {ISO date}

## Results

| Iteration | Value | Delta | Verdict | Change |
|-----------|-------|-------|---------|--------|
| baseline  | {N}   | —     | —       | —      |
| 1         | {N}   | {+/-} | KEEP    | {desc} |
| 2         | {N}   | {+/-} | DISCARD | {desc} |

## Outcome
- **Start**: {baseline}
- **End**: {final value}
- **Improvement**: {percentage}
- **Iterations**: {kept}/{total}
- **Stop reason**: {convergence|diminishing|budget}

## Kept Changes
{List of changes that were kept, with commit hashes}

Also log to .planning/telemetry/agent-runs.jsonl:

{"event":"experiment-complete","slug":"{slug}","baseline":0,"final":0,"improvement":"0%","kept":0,"total":0,"timestamp":"ISO"}

Common Metrics

| Goal | Metric Command | |------|---------------| | Reduce bundle size | npm run build 2>&1 \| grep -oP 'Total size: \K\d+' | | Reduce type errors | npx tsc --noEmit 2>&1 \| grep -c 'error TS' | | Increase test pass rate | npm test 2>&1 \| grep -oP '\d+ passing' | | Reduce file count | find src -name '*.ts' \| wc -l | | Reduce line count | wc -l src/**/*.ts \| tail -1 \| awk '{print $1}' |

When to Use

When you want to optimize a measurable metric (bundle size, error count, test coverage, FPS)
When you have a clear hypothesis but aren't sure which of several approaches wins
When manual A/B testing would be too slow or error-prone
NOT when the goal is subjective ("make it feel better") — the metric must be a number

Safety Rules

NEVER modify files outside scope
ALWAYS use worktree isolation for changes
ALWAYS run typecheck before keeping a change
Restore stashed changes on exit (even on error)
If the metric command fails, treat as DISCARD (not crash)

Contextual Gates

Disclosure: "Running experiment loop on [target] with fitness: [function]. Each iteration commits. Budget: [N iterations]." Reversibility: amber — modifies source files across iterations; each iteration is committed; undo with git revert on kept commits. Trust gates:

Familiar (5+ sessions): iterates and commits autonomously; novices should use /improve with manual review between steps.

Quality Gates

Baseline was measured before any iterations ran
Every kept iteration improved the metric AND passed typecheck
Every discarded iteration has a logged reason
The stop reason is one of: convergence, diminishing returns, or budget exhausted
The experiment report exists at .planning/research/experiment-{slug}.md with all iteration rows filled

Fringe Cases

Metric command outputs nothing or non-numeric text: Treat as a metric failure. Ask the user to provide a command that outputs a single number to stdout before starting iterations.

No worktree support (e.g., shallow clone): Fall back to branch isolation. Create a branch, run changes there, measure, then delete or merge the branch. Never modify the working tree directly.

If .planning/research/ does not exist: Create it before writing the experiment report. If .planning/ itself doesn't exist, create the full path or output the report inline.

Budget exhausted with zero kept iterations: Report outcome as "no improvement found". This is a valid result — do not continue past the budget.

Exit Protocol

---HANDOFF---
- Experiment: {description}
- Result: {baseline} → {final} ({improvement}%)
- Kept: {N}/{total} iterations
- Stop reason: {reason}
- Report: .planning/research/experiment-{slug}.md
- Reversibility: amber — undo kept iterations with `git revert` on each kept commit
---

SethGammon/experiment

skills/experiment/SKILL.md

Automated optimization loop with scalar fitness function. Proposes changes in isolated worktrees, measures with a metric command, keeps improvements, discards failures. Supports convergence detection and diminishing returns.

546 stars

data-ai

Updated May 8, 2026

$ install --global

skillsauth

npx skillsauth add SethGammon/Citadel experiment

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 8, 2026, 2:44 AM196.8s3 files scanned

SKILL.md

name:: experiment
description:: >-
user-invocable:: true
auto-trigger:: false
last-updated:: 2026-03-21

/experiment — Metric-Driven Optimization Loop

Inputs

The user provides three things:

scope: Files to modify (glob pattern, e.g., "src/api/**/*.ts")
metric: Shell command that outputs a single number (e.g., npm run build 2>&1 | tail -1 | grep -oP '\d+')
budget: Iteration cap (default: 5) or time cap (e.g., "10 minutes")

If any input is missing, ask for it. The metric MUST output a single number to stdout.

Protocol

Step 1: BASELINE

Stash any uncommitted changes (restore on exit)
Run the metric command. Record the baseline value.
Determine direction: does lower = better (bundle size, error count) or higher = better (FPS, test count)? Ask the user if ambiguous.
Log: Baseline: {value} ({metric command})

Step 2: ITERATE

For each iteration (up to budget):

Create isolation: Spawn a sub-agent in a worktree (isolation: "worktree")
Propose change: The agent modifies files within scope to improve the metric. Provide context: baseline value, metric direction, scope, what previous iterations tried.
Measure: Run the metric command in the worktree (via node scripts/run-with-timeout.js 300)
Gate: Run typecheck (also via timeout wrapper). If it fails, discard immediately.
Evaluate:
- Improved? → KEEP. Merge the worktree branch. New baseline = new value.
- Same or worse? → DISCARD. Delete the worktree.

Log iteration:

Iteration {N}: {value} ({delta from baseline}) → {KEEP|DISCARD}
Change: {one-line description of what was tried}

Step 3: CONVERGENCE CHECK

After each iteration, check:

Local optimum: Last 3 iterations all discarded → stop ("no more improvements found")
Diminishing returns: Last kept improvement was < 0.5% → stop ("diminishing returns")
Budget exhausted: Iteration count or time exceeded → stop

Step 4: REPORT

Write results to .planning/research/experiment-{slug}.md:

# Experiment: {Description}

> Metric: `{command}`
> Direction: {lower|higher} is better
> Scope: {glob pattern}
> Budget: {N iterations}
> Date: {ISO date}

## Results

| Iteration | Value | Delta | Verdict | Change |
|-----------|-------|-------|---------|--------|
| baseline  | {N}   | —     | —       | —      |
| 1         | {N}   | {+/-} | KEEP    | {desc} |
| 2         | {N}   | {+/-} | DISCARD | {desc} |

## Outcome
- **Start**: {baseline}
- **End**: {final value}
- **Improvement**: {percentage}
- **Iterations**: {kept}/{total}
- **Stop reason**: {convergence|diminishing|budget}

## Kept Changes
{List of changes that were kept, with commit hashes}

Also log to .planning/telemetry/agent-runs.jsonl:

{"event":"experiment-complete","slug":"{slug}","baseline":0,"final":0,"improvement":"0%","kept":0,"total":0,"timestamp":"ISO"}

Common Metrics

When to Use

When you want to optimize a measurable metric (bundle size, error count, test coverage, FPS)
When you have a clear hypothesis but aren't sure which of several approaches wins
When manual A/B testing would be too slow or error-prone
NOT when the goal is subjective ("make it feel better") — the metric must be a number

Safety Rules

NEVER modify files outside scope
ALWAYS use worktree isolation for changes
ALWAYS run typecheck before keeping a change
Restore stashed changes on exit (even on error)
If the metric command fails, treat as DISCARD (not crash)

Contextual Gates

Familiar (5+ sessions): iterates and commits autonomously; novices should use /improve with manual review between steps.

Quality Gates

Baseline was measured before any iterations ran
Every kept iteration improved the metric AND passed typecheck
Every discarded iteration has a logged reason
The stop reason is one of: convergence, diminishing returns, or budget exhausted
The experiment report exists at .planning/research/experiment-{slug}.md with all iteration rows filled

Fringe Cases

Metric command outputs nothing or non-numeric text: Treat as a metric failure. Ask the user to provide a command that outputs a single number to stdout before starting iterations.

No worktree support (e.g., shallow clone): Fall back to branch isolation. Create a branch, run changes there, measure, then delete or merge the branch. Never modify the working tree directly.

If .planning/research/ does not exist: Create it before writing the experiment report. If .planning/ itself doesn't exist, create the full path or output the report inline.

Budget exhausted with zero kept iterations: Report outcome as "no improvement found". This is a valid result — do not continue past the budget.

Exit Protocol

---HANDOFF---
- Experiment: {description}
- Result: {baseline} → {final} ({improvement}%)
- Kept: {N}/{total} iterations
- Stop reason: {reason}
- Report: .planning/research/experiment-{slug}.md
- Reversibility: amber — undo kept iterations with `git revert` on each kept commit
---

Related Skills

SethGammon/triage

development

VerifiedTrustedCommunity

GitHub issue and PR investigator. Pulls open issues/PRs, classifies them, searches the codebase for root cause or reviews contributed code, proposes fixes with file:line references, and optionally implements fixes. Use for investigating GitHub issues and reviewing PRs; do NOT use for general code review unrelated to GitHub issues.

586SKILL.mdUpdated Apr 21, 2026

SethGammon/telemetry

development

VerifiedTrustedCommunity

Unified telemetry hub. Shows current session cost, today's spend, all-time totals, hook activity, trust level, and a directory of every telemetry command available. Also the control surface to toggle telemetry on/off and tune thresholds. Single entry point for anyone asking "what does this cost" or "what telemetry does Citadel have".

586SKILL.mdUpdated Apr 21, 2026

SethGammon/schedule

devops

VerifiedTrustedCommunity

Manages recurring and one-off scheduled tasks. Session-scoped scheduling via CronCreate/CronDelete/CronList. Documents the cloud path for tasks that need to survive machine sleep or network drops.

586SKILL.mdUpdated Apr 21, 2026

SethGammon/qa

tools

VerifiedTrustedCommunity

Browser-based QA verification. Launches a real browser, navigates the app, clicks buttons, fills forms, and tests user flows. Works as a standalone skill or as a phase end condition in campaigns. Requires Playwright (optional dependency, graceful skip if not installed).

586SKILL.mdUpdated Apr 21, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/SethGammon/Citadel.git

# Copy into Claude Code skills folder (global)
cp -r Citadel/skills/experiment ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

SethGammon/Citadel

546 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT