skills/circuit-surface-test/SKILL.md
Comprehensively manually test the Circuit plugin's user-facing surface in either Claude Code or Codex. Use this skill whenever the user asks to "manually test Circuit", "QA the Circuit plugin", "exercise the Circuit surface", "run the Circuit checklist", "smoke test Circuit", "find regressions in Circuit", "test the Claude Circuit plugin", "test the Codex Circuit plugin", or when preparing a Circuit release for marketplace publication. Argument is the host package to test — `claude` or `codex`. Produces a Markdown report with per-command pass/fail, exploratory findings ranked by severity, run-folder evidence links, and a concise terminal summary. Use even if the user does not say the word "test" — phrases like "go through every Circuit command" or "make sure Circuit still works end-to-end" should also trigger.
npx skillsauth add petekp/claude-skills circuit-surface-testInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are about to act as a QA tester for the Circuit plugin. The user has asked you to walk the entire user-facing surface of the plugin in a specific host (Claude Code or Codex), looking for regressions, broken progress rendering, missing summaries, and other defects a real user would hit.
This skill is a protocol, not a script. It tells you what to test, in what
order, what counts as a finding, and what to put in the report. The source
inventory and two host-specific checklists in references/ carry the
per-command details.
The user's argument tells you which plugin package to test:
claude — the Claude Code plugin at plugins/claude/. The published slash
commands are /circuit:run and /circuit:handoff only. Built-in flows
(build, explore, fix, prototype, pursue, review) are routed through Run; they
ship as compiled flow mirrors under plugins/claude/skills/<flow>/, not as
separate slash commands.codex — the Codex plugin at plugins/codex/. The published Codex
commands/skills are run and handoff only; flows are routed through Run and
ship as compiled mirrors under plugins/codex/flows/<flow>/.If the argument is missing or ambiguous, ask one short question before proceeding. Do not guess.
Circuit's surface is meant to be triggered from the host the plugin targets. There are four argument × executing-host combinations and they are not equivalent:
| You are running in | User asked to test | Execution mode |
|---|---|---|
| Claude Code | claude | Native. Use real /circuit:* slash commands. |
| Codex | codex | Native. Use real Codex Circuit skills. |
| Claude Code | codex | Cross-host. Slash commands won't work for the Codex package. Drop to CLI-only and inspect the Codex skill manifests for surface coverage. Note this clearly in the report. |
| Codex | claude | Cross-host. Same situation, reversed. |
When you can't natively invoke a host's surface, you are testing a strict
subset (CLI behavior + manifest correctness). State that limitation in the
report's Environment block so the operator isn't misled.
Circuit's automated tests cover runtime contracts, generated surfaces, and focused connector paths. They do not prove every installed host package path, native rendering path, wrapper injection, and real worker subprocess shape in one user-facing pass. The whole point of this protocol is exercising the installed or packaged host surface against strict schemas.
Concrete consequence: before declaring a clean run, you must have executed at least one default-axis invocation per public flow against the real connector, reached either through Run (host recommendation or the deterministic router) or the explicit CLI flow start, since no flow ships its own host command. If a default-axis flow aborts on its first relay step, that is the headline finding and the rest of the surface coverage is moot until it's fixed. Section A0 of the checklist exists for exactly this reason — run it first, stop on structural failure.
Circuit is a tool with many entry points, modes, and outcomes. Skipping straight to exploratory testing tends to find one or two showy bugs and miss the boring regressions that matter for a release. Inverting the order — run a fixed checklist first, then circle back to anything that looked off and probe deeper — gives the operator both a baseline pass-rate signal and a ranked issues list.
Run the checklist for the chosen host. Mark each item pass / pass-with-finding / fail / skipped / partial-skip with a one-line note and a pointer to its run folder. Then, for any failures or anything that smelled wrong (slow, ugly progress output, vague error, missing field in the summary), do a focused exploratory pass.
Code-changing Circuit flows mutate files. Running them against the user's working repo would damage the actual project. Before starting, set up an isolated scratch repo unless the user has told you to test against the live repo.
# macOS mktemp -t treats the suffix as a literal — drop the trailing X's
SCRATCH=$(mktemp -d "${TMPDIR:-/tmp}/circuit-surface-test.XXXXXX")
cd "$SCRATCH"
git init -q
echo "# scratch fixture for circuit-surface-test" > README.md
echo "function add(a, b) { return a + b; }" > add.js
echo "function buggyAdd(a, b) { return a - b; }" > bug.js
# Stub package.json so flows that emit a verify step have something to run.
# Without this, Build's verify step has nothing to invoke and may abort the
# run even when the implementation step landed correct edits.
cat > package.json <<'JSON'
{
"name": "scratch",
"private": true,
"scripts": {
"verify": "echo ok",
"test": "echo ok",
"check": "echo ok",
"lint": "echo ok"
}
}
JSON
git add -A && git -c [email protected] -c user.name=qa commit -q -m "fixture"
echo "Scratch: $SCRATCH"
The bug.js file is a planted bug for exercising Fix. The add.js file
gives Build something innocuous to act on. The stub package.json ensures
verify-style steps have an executable target.
Add more fixture files only if a checklist item asks for one.
For Explore and Review you can stay inside the scratch repo (or, if the user explicitly asks, the live Circuit repo) — those flows do not mutate.
Most surface tests should use the installed plugin root or the repo plugin root.
The wrapper uses its bundled runtime by default. If you explicitly need to test
a source checkout while operating on the scratch repo, pass the source CLI as a
development override and keep the shell cwd at $SCRATCH:
cd "$SCRATCH"
CIRCUIT_CLI="$WORKTREE/bin/circuit" \
node "$WORKTREE/plugins/claude/scripts/circuit.ts" run <flow> ...
Do not set project-root environment variables just to make the wrapper find a runtime. The project root the flow operates on should stay the scratch repo unless the operator explicitly asked for live-repo coverage.
The report and per-run-folder evidence must outlive the scratch repo, since
operators sometimes want to re-read findings days later and mktemp -d
output gets cleaned. Default location:
REPORT_ROOT="$HOME/circuit-surface-test/$(date -u +%Y-%m-%dT%H-%M-%SZ)"
mkdir -p "$REPORT_ROOT"
If the user passed an explicit --keep-report-at <path> (or set
CIRCUIT_SURFACE_TEST_REPORT_ROOT), use that instead. Pass
--run-folder "$REPORT_ROOT/<flow-id>" to every Circuit invocation so all
evidence lands inside the report tree, not inside the scratch repo. Write
the final report to $REPORT_ROOT/report.md.
Follow these steps in order. Do not improvise the order of phases — the phases exist to keep the report honest.
Read references/current-surface-inventory.md in full, then run its evidence
commands from the current Circuit checkout. Record the observed commit,
dirty state, host packages, CLI help, wrapper versions, and wrapper doctor
summaries in the report. When testing a release or installed host package, also
record npm run doctor:plugins:installed so the report proves the local host
caches match the source package being tested.
The generated source map, generated host packages, local CLI/help output, runtime bundles, contracts, and specs are authority. If the checklist and the inventory disagree, trust the observable source, mark the checklist row stale, and record a finding against this skill rather than inventing expected behavior.
Read either references/checklist-claude.md or references/checklist-codex.md
in full. Each checklist enumerates every command/skill, the axis flags to
test, the expected outcomes, and what counts as a finding. Do not paraphrase
items from memory — read them.
Create the scratch repo as described above. Resolve the plugin root for the chosen host:
CIRCUIT_REPO="${CIRCUIT_REPO:-/Users/petepetrash/Code/circuit}"
# Claude Code
PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-$CIRCUIT_REPO/plugins/claude}"
# Codex
PLUGIN_ROOT="${CODEX_PLUGIN_ROOT:-$CIRCUIT_REPO/plugins/codex}"
Sanity-check the plugin actually exists and the CLI runs:
node "$PLUGIN_ROOT/scripts/circuit.ts" --help
node "$PLUGIN_ROOT/scripts/circuit.ts" version --json
node "$PLUGIN_ROOT/scripts/circuit.ts" doctor --json
If any of those fail, stop. Record the failure in the report's Environment
block and surface it to the user — the rest of the checklist is moot.
Then snapshot the version under test so the report names exactly what got exercised:
git -C "$CIRCUIT_REPO" log -1 --format='%H %s'
git -C "$CIRCUIT_REPO" status --short
node "$PLUGIN_ROOT/scripts/circuit.ts" version --json
Record the commit, plugin version, and any uncommitted-files list in
the report's Environment block. If git status shows uncommitted
changes that affect plugin behavior (anything in plugins/, src/,
commands/, scripts/), call them out explicitly — the test only
exercises committed state if you ran npm run build after switching
worktrees, so divergence between working-tree intent and what's
actually being tested is itself a finding worth surfacing.
The plugin ships a wrapper at $PLUGIN_ROOT/scripts/circuit.ts which
injects packaged-flow paths and other host-specific arguments before
forwarding to the underlying CLI. The wrapper has its own injection logic
and is a separate failure surface from the bare bin.
Whenever a checklist row is exercised through the host, you are testing the
wrapper. CLI fallback rows in the checklists already use the wrapper path,
so most coverage is automatic. But if you spot a wrapper-specific bug
(e.g. wrapper injects a flag the underlying subcommand rejects), confirm it
also fails through the host and reproduces against the wrapper from a
plain shell — not just against bin/circuit directly. Wrapper-only
bugs are easy to miss if the bare bin is your default debugging path.
Walk the checklist top to bottom. For each item:
outcome: aborted
even though the implementation step succeeded). Record the row as
pass-with-finding and file a separate exploratory finding for the
broken criterion. Without this tier, every "edits landed but verify
cycle aborted the run" turns into a Fail and obscures the real
regression count.After the checklist is complete, look at every fail and every "smelled wrong" pass (slow, weird progress text, surprising summary, ambiguous error, sluggish render between events). Pick the most suspicious 3-5 and probe deeper. The probes that tend to find real issues:
$(cmd)-shaped substrings, embedded newlines. The shell-escape
rule in every command requires single-quote wrapping with '\'' for
literal apostrophes — verify it actually holds.--rigor deep --tournament --tournament-n 2 and supported
autonomous runs. Also test fail-closed combinations such as Build
--tournament, Review --autonomous, and --tournament-n 2 without
--tournament.--checkpoint-choice from a fresh shell.operator_summary_markdown_path verbatim. Verify the rendered summary
matches the file. When operator_summary_status_text or
operator_summary_html_path is present, verify host behavior against
docs/contracts/host-rendering.md.--include-untracked-content
changes behavior (default omits content; flag includes it).For each probe, write down what you tried, what happened, and what the expected behavior was if the result looks wrong.
Before writing the final report, run at least one local smoke that proves the protocol can be followed without depending on a live worker connector. Good minimum smokes:
node "$PLUGIN_ROOT/scripts/circuit.ts" version --json
node "$PLUGIN_ROOT/scripts/circuit.ts" doctor --json
node "$PLUGIN_ROOT/scripts/circuit.ts" handoff save \
--goal 'surface-test protocol smoke' \
--next 'confirm wrapper accepts utility subcommand args' \
--state-markdown 'scratch session' \
--debt-markdown 'none' \
--progress jsonl
Record the smoke transcript and result in the report. This smoke is not a replacement for the checklist; it only proves the revised protocol's local setup and wrapper path can be followed.
Use assets/report-template.md as the structure. Fill every section. The
report must be self-contained — someone reading it without your terminal
should be able to assess plugin health. In particular:
27/31 pass, 3 fail, 1 pass-with-finding, 1 skipped, 2 partial-skip) at the top.Save the report to the path described in "Report path" above.
After saving the report, print to the terminal — verbatim, no preamble:
Circuit surface test — host: <claude|codex>
Pass rate: <P>/<T> (<F> failed, <PWF> pass-with-finding, <S> skipped, <PS> partial-skip)
Critical: <count> | High: <count> | Medium: <count> | Low: <count>
Top findings:
1. <highest-severity finding> — <one-line>
2. <next highest> — <one-line>
3. <next highest> — <one-line>
Report: <absolute report path>
Scratch repo: <absolute scratch path>
Top findings are the three highest-severity findings overall — Critical first, then High, then Medium — not one-per-bucket. If there are fewer than three findings, list however many there are. The operator wants to glance and know the health number without opening the file.
A finding is anything a real user would notice and complain about. Concretely:
presentation.status_text, or when
presentation is absent, a visible display.text event.--mode or --depth appears in a generated host
surface or checklist row.A finding is NOT:
You are testing on the operator's behalf. Be terse and concrete in the report. Do not soften findings to be polite. If something is broken, say it is broken and show the repro. If everything passed, say so plainly — a clean report is a real result.
references/checklist-claude.md — Claude Code surface checklistreferences/checklist-codex.md — Codex surface checklistreferences/current-surface-inventory.md — current source-backed surface mapassets/report-template.md — report structureplugins/claude/ and plugins/codex/src/cli/circuit.tsdevelopment
Turn the prompt supplied with this skill into a concise, auditable Codex Goal or explain why a Goal is not the right fit. Use when the user asks to draft, formulate, rewrite, tighten, or create a `/goal` from a plain-language task, especially for multi-step work that needs a durable objective, evidence-based completion, constraints, iteration policy, and a default adversarial review loop.
development
Give the human a fast, plain-English catch-up on what changed in the project: what the agents did, why, and what decisions need their input. Use this whenever the user asks to "catch me up", "what changed", "where are we", "recap", "brief me", "give me the rundown", "what did you do", "summarize the session", "fill me in", or otherwise signals they have been away and want to get back up to speed quickly. Built for someone steering several agent-driven projects at once who does not read the code closely but needs to grasp the core ideas, the choices made, and the open decisions well enough to steer. Trigger even if they do not use these exact words: any request to get oriented on recent progress should use this skill.
tools
Expert Unix and macOS systems engineer for shell scripting, system administration, command-line tools, launchd, Homebrew, networking, and low-level system tasks. Use when the user asks about Unix commands, shell scripts, macOS system configuration, process management, or troubleshooting system issues.
development
Extract a DDD-style ubiquitous language glossary from the current conversation, flagging ambiguities and proposing canonical terms. Saves to UBIQUITOUS_LANGUAGE.md. Use when user wants to define domain terms, build a glossary, harden terminology, create a ubiquitous language, or mentions "domain model" or "DDD".