Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jankneumann/gen-eval

Name: gen-eval
Author: jankneumann

skills/gen-eval/SKILL.md

npx skillsauth add jankneumann/agentic-coding-tools gen-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Gen-Eval

Run the generator-evaluator testing framework against live or local services. Generates test scenarios from interface descriptors, executes them, and evaluates results against expected behavior.

Arguments

$ARGUMENTS - Optional flags:

--descriptor <path> — Path to interface descriptor YAML (auto-detected if omitted)
--mode <mode> (default: template-only) — template-only, cli-augmented, or sdk-only
--cli-command <cmd> (default: claude) — CLI tool for cli-augmented mode
--time-budget <minutes> (default: 60) — Time budget for CLI mode
--sdk-budget <usd> — USD budget cap for SDK mode
--max-iterations <n> (default: 1) — Feedback loop iterations
--parallel <n> (default: 5) — Concurrent scenario execution
--changed-features-ref <git-ref> — Git ref for change detection
--categories <cat1> [cat2 ...] — Filter to specific categories
--report-format <format> (default: both) — markdown, json, or both
--output-dir <path> (default: .) — Report output directory
--no-services — Skip service startup/teardown
--verbose — Enable verbose output
--openspec-change <change-id> — OpenSpec change-id whose ### Requirement: and #### Scenario: blocks augment the cli-augmented prompt as additional constraints. Effective only with --mode cli-augmented. The change-id MUST match ^[a-zA-Z0-9_-]+$ (alphanumeric, underscore, hyphen only — no path separators or shell metacharacters); invalid values cause exit status 64 before any filesystem walk. When the change directory or its specs/ subdirectory is missing, gen-eval logs a warning and falls back to descriptor-only generation. Generated Scenario objects emitted under this flag include a source.openspec_scenario field of the form openspec/changes/<id>/specs/<rel>.md:<line-start>-<line-end> so failures can be traced to the originating Requirement.

Steps

1. Auto-Detect Descriptor

If --descriptor is not provided, find the nearest descriptor YAML:

DESCRIPTOR=$(find . -path "*/evaluation/gen_eval/descriptors/*.yaml" -type f 2>/dev/null | head -1)

if [ -z "$DESCRIPTOR" ]; then
  echo "ERROR: No gen-eval descriptor found. Provide --descriptor <path> or create one with /gen-eval-scenario."
  exit 1
fi
echo "Auto-detected descriptor: $DESCRIPTOR"

2. Detect Project Root and Activate Venv

# Find the project root (directory containing the descriptor's evaluation/ parent)
PROJECT_ROOT=$(dirname "$(dirname "$(dirname "$(dirname "$DESCRIPTOR")")")")
echo "Project root: $PROJECT_ROOT"

# Activate the project venv
if [ -f "$PROJECT_ROOT/.venv/bin/python" ]; then
  PYTHON="$PROJECT_ROOT/.venv/bin/python"
else
  PYTHON="python3"
fi

3. Parse Mode and Build Command

Parse $ARGUMENTS for mode and flags. Build the CLI command:

# Defaults
MODE="${MODE:-template-only}"
PARALLEL="${PARALLEL:-5}"
MAX_ITER="${MAX_ITER:-1}"
REPORT_FORMAT="${REPORT_FORMAT:-both}"
OUTPUT_DIR="${OUTPUT_DIR:-.}"

CMD="$PYTHON -m evaluation.gen_eval --descriptor $DESCRIPTOR --mode $MODE --parallel $PARALLEL --max-iterations $MAX_ITER --report-format $REPORT_FORMAT --output-dir $OUTPUT_DIR"

# Append optional flags from arguments
if [ -n "$TIME_BUDGET" ]; then CMD="$CMD --time-budget $TIME_BUDGET"; fi
if [ -n "$SDK_BUDGET" ]; then CMD="$CMD --sdk-budget $SDK_BUDGET"; fi
if [ -n "$CLI_COMMAND" ]; then CMD="$CMD --cli-command $CLI_COMMAND"; fi
if [ -n "$CHANGED_REF" ]; then CMD="$CMD --changed-features-ref $CHANGED_REF"; fi
if [ -n "$CATEGORIES" ]; then CMD="$CMD --categories $CATEGORIES"; fi
if [ "$NO_SERVICES" = "true" ]; then CMD="$CMD --no-services"; fi
if [ "$VERBOSE" = "true" ]; then CMD="$CMD --verbose"; fi

4. Run Gen-Eval

Execute from the project root:

cd "$PROJECT_ROOT"
echo "Running: $CMD"
$CMD
EXIT_CODE=$?

5. Report Results

After execution, display a summary:

If reports were generated, read and summarize the markdown report
Show pass rate, coverage %, and any failures
If EXIT_CODE != 0, highlight failing scenarios and suggest /gen-eval-scenario for authoring targeted scenarios

if [ -f "$OUTPUT_DIR/gen-eval-report.md" ]; then
  echo ""
  echo "=== Gen-Eval Report ==="
  cat "$OUTPUT_DIR/gen-eval-report.md"
fi

Quick Start

The simplest invocation — auto-detects the descriptor and runs template-only:

/gen-eval

With CLI-augmented generation (subscription-covered):

/gen-eval --mode cli-augmented --time-budget 30

Augmenting the cli-augmented prompt with OpenSpec scenarios from an active change:

/gen-eval --mode cli-augmented --openspec-change my-feature-change-id

Against specific categories:

/gen-eval --categories lock-lifecycle auth-boundary

Integration Points

/validate-feature: Gen-eval runs as phase 4b (between smoke and e2e). Auto-detected when descriptors exist.
/explore-feature: Gen-eval report signals (failing interfaces, coverage gaps) feed into feature opportunity ranking.
/gen-eval-scenario: Create new scenario YAML files interactively.
make gen-eval: Makefile shorthand for the most common invocation.

Output

gen-eval-report.md — Markdown report with pass/fail summary
gen-eval-report.json — Machine-readable results
gen-eval-metrics.json — Per-scenario metrics for pipeline integration
Exit code 0 if pass rate meets threshold (default 95%), 1 otherwise

jankneumann/gen-eval

skills/gen-eval/SKILL.md

Run generator-evaluator testing against live services

4 stars

testing

Updated May 15, 2026

$ install --global

skillsauth

npx skillsauth add jankneumann/agentic-coding-tools gen-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 15, 2026, 7:44 AM123.0s1 file scanned

SKILL.md

name:: gen-eval
description:: Run generator-evaluator testing against live services
category:: Testing
tags:: [testing, gen-eval, evaluation, scenarios, generator, evaluator]

Gen-Eval

Run the generator-evaluator testing framework against live or local services. Generates test scenarios from interface descriptors, executes them, and evaluates results against expected behavior.

Arguments

$ARGUMENTS - Optional flags:

--descriptor <path> — Path to interface descriptor YAML (auto-detected if omitted)
--mode <mode> (default: template-only) — template-only, cli-augmented, or sdk-only
--cli-command <cmd> (default: claude) — CLI tool for cli-augmented mode
--time-budget <minutes> (default: 60) — Time budget for CLI mode
--sdk-budget <usd> — USD budget cap for SDK mode
--max-iterations <n> (default: 1) — Feedback loop iterations
--parallel <n> (default: 5) — Concurrent scenario execution
--changed-features-ref <git-ref> — Git ref for change detection
--categories <cat1> [cat2 ...] — Filter to specific categories
--report-format <format> (default: both) — markdown, json, or both
--output-dir <path> (default: .) — Report output directory
--no-services — Skip service startup/teardown
--verbose — Enable verbose output
--openspec-change <change-id> — OpenSpec change-id whose ### Requirement: and #### Scenario: blocks augment the cli-augmented prompt as additional constraints. Effective only with --mode cli-augmented. The change-id MUST match ^[a-zA-Z0-9_-]+$ (alphanumeric, underscore, hyphen only — no path separators or shell metacharacters); invalid values cause exit status 64 before any filesystem walk. When the change directory or its specs/ subdirectory is missing, gen-eval logs a warning and falls back to descriptor-only generation. Generated Scenario objects emitted under this flag include a source.openspec_scenario field of the form openspec/changes/<id>/specs/<rel>.md:<line-start>-<line-end> so failures can be traced to the originating Requirement.

Steps

1. Auto-Detect Descriptor

If --descriptor is not provided, find the nearest descriptor YAML:

DESCRIPTOR=$(find . -path "*/evaluation/gen_eval/descriptors/*.yaml" -type f 2>/dev/null | head -1)

if [ -z "$DESCRIPTOR" ]; then
  echo "ERROR: No gen-eval descriptor found. Provide --descriptor <path> or create one with /gen-eval-scenario."
  exit 1
fi
echo "Auto-detected descriptor: $DESCRIPTOR"

2. Detect Project Root and Activate Venv

# Find the project root (directory containing the descriptor's evaluation/ parent)
PROJECT_ROOT=$(dirname "$(dirname "$(dirname "$(dirname "$DESCRIPTOR")")")")
echo "Project root: $PROJECT_ROOT"

# Activate the project venv
if [ -f "$PROJECT_ROOT/.venv/bin/python" ]; then
  PYTHON="$PROJECT_ROOT/.venv/bin/python"
else
  PYTHON="python3"
fi

3. Parse Mode and Build Command

Parse $ARGUMENTS for mode and flags. Build the CLI command:

# Defaults
MODE="${MODE:-template-only}"
PARALLEL="${PARALLEL:-5}"
MAX_ITER="${MAX_ITER:-1}"
REPORT_FORMAT="${REPORT_FORMAT:-both}"
OUTPUT_DIR="${OUTPUT_DIR:-.}"

CMD="$PYTHON -m evaluation.gen_eval --descriptor $DESCRIPTOR --mode $MODE --parallel $PARALLEL --max-iterations $MAX_ITER --report-format $REPORT_FORMAT --output-dir $OUTPUT_DIR"

# Append optional flags from arguments
if [ -n "$TIME_BUDGET" ]; then CMD="$CMD --time-budget $TIME_BUDGET"; fi
if [ -n "$SDK_BUDGET" ]; then CMD="$CMD --sdk-budget $SDK_BUDGET"; fi
if [ -n "$CLI_COMMAND" ]; then CMD="$CMD --cli-command $CLI_COMMAND"; fi
if [ -n "$CHANGED_REF" ]; then CMD="$CMD --changed-features-ref $CHANGED_REF"; fi
if [ -n "$CATEGORIES" ]; then CMD="$CMD --categories $CATEGORIES"; fi
if [ "$NO_SERVICES" = "true" ]; then CMD="$CMD --no-services"; fi
if [ "$VERBOSE" = "true" ]; then CMD="$CMD --verbose"; fi

4. Run Gen-Eval

Execute from the project root:

cd "$PROJECT_ROOT"
echo "Running: $CMD"
$CMD
EXIT_CODE=$?

5. Report Results

After execution, display a summary:

If reports were generated, read and summarize the markdown report
Show pass rate, coverage %, and any failures
If EXIT_CODE != 0, highlight failing scenarios and suggest /gen-eval-scenario for authoring targeted scenarios

if [ -f "$OUTPUT_DIR/gen-eval-report.md" ]; then
  echo ""
  echo "=== Gen-Eval Report ==="
  cat "$OUTPUT_DIR/gen-eval-report.md"
fi

Quick Start

The simplest invocation — auto-detects the descriptor and runs template-only:

/gen-eval

With CLI-augmented generation (subscription-covered):

/gen-eval --mode cli-augmented --time-budget 30

Augmenting the cli-augmented prompt with OpenSpec scenarios from an active change:

/gen-eval --mode cli-augmented --openspec-change my-feature-change-id

Against specific categories:

/gen-eval --categories lock-lifecycle auth-boundary

Integration Points

/validate-feature: Gen-eval runs as phase 4b (between smoke and e2e). Auto-detected when descriptors exist.
/explore-feature: Gen-eval report signals (failing interfaces, coverage gaps) feed into feature opportunity ranking.
/gen-eval-scenario: Create new scenario YAML files interactively.
make gen-eval: Makefile shorthand for the most common invocation.

Output

gen-eval-report.md — Markdown report with pass/fail summary
gen-eval-report.json — Machine-readable results
gen-eval-metrics.json — Per-scenario metrics for pipeline integration
Exit code 0 if pass rate meets threshold (default 95%), 1 otherwise

Related Skills

jankneumann/review-artifacts

development

VerifiedTrustedCommunity

Open the artifacts relevant to a review (OpenSpec proposal, branch changes, or explicit paths) in VS Code, in a curated read-order, in the right worktree.

4SKILL.mdUpdated May 24, 2026

jankneumann/review-artifacts

jankneumann/coordinator-task-status-renderer

tools

VerifiedTrustedCommunity

Render and seed coordinator-owned task status block in OpenSpec tasks.md

4SKILL.mdUpdated May 17, 2026

jankneumann/coordinator-task-status-renderer

jankneumann/missing-tail-block

testing

VerifiedTrustedCommunity

User-invocable skill that omits the tail block

4SKILL.mdUpdated May 15, 2026

jankneumann/missing-tail-block

jankneumann/missing-keys

tools

VerifiedTrustedCommunity

Missing several required keys

4SKILL.mdUpdated May 15, 2026

jankneumann/missing-keys

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jankneumann/agentic-coding-tools.git

# Copy into Claude Code skills folder (global)
cp -r agentic-coding-tools/skills/gen-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jankneumann/agentic-coding-tools

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT