skills/gen-eval/SKILL.md
Run generator-evaluator testing against live services
npx skillsauth add jankneumann/agentic-coding-tools gen-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run the generator-evaluator testing framework against live or local services. Generates test scenarios from interface descriptors, executes them, and evaluates results against expected behavior.
$ARGUMENTS - Optional flags:
--descriptor <path> — Path to interface descriptor YAML (auto-detected if omitted)--mode <mode> (default: template-only) — template-only, cli-augmented, or sdk-only--cli-command <cmd> (default: claude) — CLI tool for cli-augmented mode--time-budget <minutes> (default: 60) — Time budget for CLI mode--sdk-budget <usd> — USD budget cap for SDK mode--max-iterations <n> (default: 1) — Feedback loop iterations--parallel <n> (default: 5) — Concurrent scenario execution--changed-features-ref <git-ref> — Git ref for change detection--categories <cat1> [cat2 ...] — Filter to specific categories--report-format <format> (default: both) — markdown, json, or both--output-dir <path> (default: .) — Report output directory--no-services — Skip service startup/teardown--verbose — Enable verbose output--openspec-change <change-id> — OpenSpec change-id whose ### Requirement: and #### Scenario: blocks augment the cli-augmented prompt as additional constraints. Effective only with --mode cli-augmented. The change-id MUST match ^[a-zA-Z0-9_-]+$ (alphanumeric, underscore, hyphen only — no path separators or shell metacharacters); invalid values cause exit status 64 before any filesystem walk. When the change directory or its specs/ subdirectory is missing, gen-eval logs a warning and falls back to descriptor-only generation. Generated Scenario objects emitted under this flag include a source.openspec_scenario field of the form openspec/changes/<id>/specs/<rel>.md:<line-start>-<line-end> so failures can be traced to the originating Requirement.If --descriptor is not provided, find the nearest descriptor YAML:
DESCRIPTOR=$(find . -path "*/evaluation/gen_eval/descriptors/*.yaml" -type f 2>/dev/null | head -1)
if [ -z "$DESCRIPTOR" ]; then
echo "ERROR: No gen-eval descriptor found. Provide --descriptor <path> or create one with /gen-eval-scenario."
exit 1
fi
echo "Auto-detected descriptor: $DESCRIPTOR"
# Find the project root (directory containing the descriptor's evaluation/ parent)
PROJECT_ROOT=$(dirname "$(dirname "$(dirname "$(dirname "$DESCRIPTOR")")")")
echo "Project root: $PROJECT_ROOT"
# Activate the project venv
if [ -f "$PROJECT_ROOT/.venv/bin/python" ]; then
PYTHON="$PROJECT_ROOT/.venv/bin/python"
else
PYTHON="python3"
fi
Parse $ARGUMENTS for mode and flags. Build the CLI command:
# Defaults
MODE="${MODE:-template-only}"
PARALLEL="${PARALLEL:-5}"
MAX_ITER="${MAX_ITER:-1}"
REPORT_FORMAT="${REPORT_FORMAT:-both}"
OUTPUT_DIR="${OUTPUT_DIR:-.}"
CMD="$PYTHON -m evaluation.gen_eval --descriptor $DESCRIPTOR --mode $MODE --parallel $PARALLEL --max-iterations $MAX_ITER --report-format $REPORT_FORMAT --output-dir $OUTPUT_DIR"
# Append optional flags from arguments
if [ -n "$TIME_BUDGET" ]; then CMD="$CMD --time-budget $TIME_BUDGET"; fi
if [ -n "$SDK_BUDGET" ]; then CMD="$CMD --sdk-budget $SDK_BUDGET"; fi
if [ -n "$CLI_COMMAND" ]; then CMD="$CMD --cli-command $CLI_COMMAND"; fi
if [ -n "$CHANGED_REF" ]; then CMD="$CMD --changed-features-ref $CHANGED_REF"; fi
if [ -n "$CATEGORIES" ]; then CMD="$CMD --categories $CATEGORIES"; fi
if [ "$NO_SERVICES" = "true" ]; then CMD="$CMD --no-services"; fi
if [ "$VERBOSE" = "true" ]; then CMD="$CMD --verbose"; fi
Execute from the project root:
cd "$PROJECT_ROOT"
echo "Running: $CMD"
$CMD
EXIT_CODE=$?
After execution, display a summary:
EXIT_CODE != 0, highlight failing scenarios and suggest /gen-eval-scenario for authoring targeted scenariosif [ -f "$OUTPUT_DIR/gen-eval-report.md" ]; then
echo ""
echo "=== Gen-Eval Report ==="
cat "$OUTPUT_DIR/gen-eval-report.md"
fi
The simplest invocation — auto-detects the descriptor and runs template-only:
/gen-eval
With CLI-augmented generation (subscription-covered):
/gen-eval --mode cli-augmented --time-budget 30
Augmenting the cli-augmented prompt with OpenSpec scenarios from an active change:
/gen-eval --mode cli-augmented --openspec-change my-feature-change-id
Against specific categories:
/gen-eval --categories lock-lifecycle auth-boundary
/validate-feature: Gen-eval runs as phase 4b (between smoke and e2e). Auto-detected when descriptors exist./explore-feature: Gen-eval report signals (failing interfaces, coverage gaps) feed into feature opportunity ranking./gen-eval-scenario: Create new scenario YAML files interactively.make gen-eval: Makefile shorthand for the most common invocation.gen-eval-report.md — Markdown report with pass/fail summarygen-eval-report.json — Machine-readable resultsgen-eval-metrics.json — Per-scenario metrics for pipeline integrationdevelopment
Open the artifacts relevant to a review (OpenSpec proposal, branch changes, or explicit paths) in VS Code, in a curated read-order, in the right worktree.
tools
Render and seed coordinator-owned task status block in OpenSpec tasks.md
testing
User-invocable skill that omits the tail block
tools
Missing several required keys