framework/devtools/skills/write-agent-benchmarks/SKILL.md
Create, maintain, and run evidence-based benchmarks for AI agents. Use when setting up testing infrastructure, writing new test scenarios, or evaluating agent performance.
npx skillsauth add korchasa/flowai write-agent-benchmarksInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill defines a universal, language-agnostic standard for benchmarking Autonomous AI Agents. The goal is to objectively measure an agent's ability to solve real-world tasks, whether they are coding, data analysis, or conversational.
The system supports three primary evaluation modes:
Quality Evaluation (Checklist-based):
Model Selection (Pairwise Comparison):
Version Comparison (Regression Tracking):
Choosing the right interaction strategy is critical for stable benchmarks.
A robust benchmarking system consists of five key modules.
The isolated state container where the task is performed. It is not limited to a file system.
Setup (initial state), Reset (between runs), and Teardown.The central controller managing the test lifecycle.
For interactive agents that ask clarifying questions.
The logic that determines if a test passed or failed based on Evidence.
Complete capture of the agent's lifecycle in a single human-readable file (e.g., trace.md or trace.json).
Follow this process to add a new benchmark scenario.
What specific capability are you testing?
Create the initial state.
Write the prompt that instructs the agent.
How do we know it worked?
script.py runs with exit code 0.users has 1 new row.GET /api/v1/flights with correct parameters.Add the scenario to your Runner's registry.
If a benchmark fails, check the Trace:
When a scenario fails, especially a verbatim_relay / mock-reached-agent
check, do NOT jump straight to rewriting SKILL.md. The test infrastructure
itself is the most common culprit and the silent-failure mode is real
(see the flowai bench history: PreToolUse key camelCase typo silenced
all Claude-adapter mocks for months, and "passing" scenarios were passing
on pattern-matching, not hook interception).
Run this checklist first:
ls <sandbox>/.claude/hooks/ should show
mock-<tool>.sh. Absence → adapter didn't call setupMocks, or
sandbox was overwritten.cat <sandbox>/.claude/settings.local.json
— the top-level event key MUST be PascalCase (PreToolUse,
PostToolUse). Claude Code silently ignores camelCase; no warning.CLAUDECODE="" claude -p …), a naive
Bash(<tool>:*) matcher will NOT fire — first token is the env
assignment. The flowai adapter uses a broad Bash matcher plus
in-script filtering; replicate that pattern if you add a new adapter.[benchmock-<6-hex>]) that is guaranteed absent from the
skill's own SKILL.md and examples. Then grep the judge output for
the sentinel: present → hook fired and agent quoted it;
absent → synthesis or pattern-match, NOT relay. This is the only
robust signal that a mock actually reached the agent.MOCK: prefix"). Mock-prefix
scaffolding is not something real CLIs emit; teaching the agent to
preserve it corrupts real-world behaviour. Design the mock so its
distinctive content is what proves the relay — not the framing.Only after steps 1–5 pass is it safe to conclude the skill itself is at fault and edit SKILL.md. Skipping this checklist and iterating on the skill text wastes bench cycles and frequently introduces regressions (e.g. "mandatory capture-to-file" rules that don't help because hooks block before the shell redirect executes).
Execution scenarios prove "when skill X runs, it works." They do NOT prove that the model picks skill X for a relevant query, or that it stands down for an unrelated one. Trigger scenarios close that gap: they verify description-matching correctness.
framework/<pack>/skills/*. Commands (commands/) carry disable-model-invocation: true and are out of scope.framework/<pack>/skills/<skill-id>/benchmarks/
trigger-pos-1/mod.ts
trigger-adj-1/mod.ts
trigger-false-1/mod.ts
<skill-id>-trigger-<pos|adj|false>-1 (the trailing -1 is preserved for backward compatibility with trace tooling; only n=1 is permitted).scripts/check-trigger-coverage.ts (wired into deno task check) fails if any of the 3 are missing, or if stray trigger-{type}-{2,3,...} directories exist.With N=1, each query carries the full description-match weight for its class — pick the phrasing most likely to expose a description regression.
trigger-pos-1): a natural, short user query that matches the skill's description. No /skill-name prefix (that bypasses description-matching), no over-specified jargon, no hints at internal mechanics. Pick the phrasing a typical user would write — the least-jargonized form — so the test stresses description match, not exact wording.trigger-adj-1): a query for which a different, neighboring skill is the correct match. Pick the most-likely confusion candidate from the same pack or with overlapping vocabulary. Typical confusion patterns: a "fix this test" skill vs. a "review my diff" skill (overlap on "I broke something"); a single-task planner vs. a multi-phase epic planner (overlap on "plan"); a current-session reflection vs. a historical-sessions reflection (overlap on "reflect").trigger-false-1): a query inside the skill's general domain but with the wrong intent. Recommended patterns: surface vocabulary that matches but the actual ask is something else (e.g., a planning skill receiving "plan" in a non-software-task sense; a fix-tests skill receiving a "speed up the test runner" perf request); reverse-intent traps (e.g., write new tests vs fix failing ones). Do NOT use meta-questions about the skill itself ("what does X cover?", "how does X work?", "when should I use X?") as false-use — under Claude Code these are legitimately answered by reading the skill's SKILL.md, so the agent will rightly load it and the judge will record activation. Treat meta-questions as positives or omit them.Every trigger scenario carries exactly one critical checklist item.
trigger-pos-*):
checklist = [{
id: "skill_invoked",
description: "Did the agent load and act on `<skill-id>` in response to this query? Look in the trace for a `Skill` tool call or a read of the skill's `SKILL.md` for `<skill-id>`.",
critical: true,
}];
trigger-adj-* and trigger-false-*):
checklist = [{
id: "skill_not_invoked",
description: "Did the agent AVOID loading `<skill-id>`? For this query the skill is not appropriate; the agent should either invoke a different skill or respond directly without reading `<skill-id>/SKILL.md` or calling the `Skill` tool with `<skill-id>`.",
critical: true,
}];
mod.ts)import { AcceptanceTestScenario } from "@acceptance-tests/types.ts";
export const TriggerPos1 = new class extends AcceptanceTestScenario {
id = "<skill-id>-trigger-pos-1";
name = "<short label, e.g. 'natural fix-tests query'>";
skill = "<skill-id>";
agentsTemplateVars = { PROJECT_NAME: "Sandbox" };
userQuery = "<natural user query>";
checklist = [{
id: "skill_invoked",
description:
"Did the agent load and act on `<skill-id>` in response to this query? Look in the trace for a `Skill` tool call or a read of the skill's `SKILL.md` for `<skill-id>`.",
critical: true,
}];
}();
Before scaling, write one positive scenario and run it; confirm the judge correctly fails the run when the skill's description is mangled to be unrelated. Then revert the description. This validates the pattern end-to-end. See SRS FR-ACCEPT.TRIGGER, SDS §3.4.2.
Subagents (files under framework/<pack>/agents/<name>.md) MUST be tested through their wrapping skill scenario, not as standalone AcceptanceTestAgentScenario runs. The framework spawns the main runtime in -p mode with userQuery as the user message; the agent .md file is copied to .claude/agents/ only as a template the main runtime may dispatch to. There is no path that loads the subagent's body as a system prompt for direct execution. A standalone AcceptanceTestAgentScenario therefore tests the main runtime's behaviour given access to the agent template — NOT the agent's body.
Two consequences:
via-subagent-style scenario (parent skill → Agent/Task tool → subagent → mocked CLI → relay back). Checklists should gate on the parent-side dispatch (worker_subagent_invoked) and the relay signal (mock_content_relayed), both of which ARE observable from the flat trace.Precedent: existing worker-style subagents in the framework have no acceptance-tests/ directory of their own; they are tested only via their orchestrating skill's scenarios.
The trace produced by formatAgentLogs does NOT preserve parent-vs-subagent nesting. When a parent invokes Agent(subagent_type=...), the subagent's internal Bash calls appear at the same top-level as the parent's tool calls. Avoid checklist items like "no Bash("codex …") in the parent" — the judge cannot distinguish parent-side from worker-side Bash from the flat trace alone. Instead gate on the presence of the Agent/Task dispatch and on the relay-content signal; together they imply the worker did the work.
The framework's PreToolUse(Bash) hook strips env assignments (FOO=bar, CLAUDECODE="") and subshell wrappers (( … ) &, backticks, $(…)) before checking whether the first bare command word matches a mocked tool. It does NOT inspect piped commands — echo "..." | codex exec - matches echo, not codex, and the mock never fires.
Consequences when authoring scenarios that mock a target CLI:
codex exec "$P", not echo "$P" | codex exec -).The framework injects mock strings like [benchmock-xxx] CODEX-MOCK: <body>. Agents under a courier rule (verbatim relay of a child runtime's stdout, common in cross-IDE or LLM-as-judge skills) will reasonably strip [benchmock-xxx] and <TOOL>-MOCK: prefixes as harness artefacts — that is correct behaviour per such a contract, not a relay failure.
To verify relay actually happened, embed a deliberately-absurd phrase inside the mock body and check for that phrase in the final answer. The phrase must be:
Examples drawn from passing scenarios: alphabetise your tuples on Wednesdays, octopus-shaped type definitions, tag mutable state with marigold-coloured comments, alphabetise trailing semicolons before lowercase Friday refactors.
Bench-prefix tokens ([benchmock-xxx]) fail both conditions: they ARE in the harness artefact category, and the courier rule permits stripping them. Don't gate on them.
To ensure cross-platform compatibility, benchmark results must follow a standard JSON schema.
{
"scenario_id": "string",
"outcome": "pass|fail",
"score": 0-100,
"metrics": {
"duration_ms": 1200,
"cost_usd": 0.01,
"steps_taken": 5,
"tokens_used": 1500
},
"evidence": {
"artifacts": ["file_paths"],
"logs": ["log_entries"]
},
"checklist": [
{ "id": "check_1", "status": "pass", "reason": "..." }
]
}
temperature: 0).development
Use when the user asks to add TypeScript strict-mode code-style rules to AGENTS.md for a TypeScript project using strict mode. Do NOT trigger for Deno projects (use setup-agent-code-style-deno) or non-strict TS configurations.
development
Use when the user asks to add Deno/TypeScript code-style rules to AGENTS.md, or during initial Deno project setup when code-style guidelines need to be established. Do NOT trigger for non-Deno TypeScript projects (use setup-agent-code-style-strict), or for runtime-agnostic style advice.
testing
Use when the user provides a source (URL, file path, or free text) to save into the project's memex — a long-term knowledge bank for AI agents. Stores the raw source, extracts entities into cross-linked pages, runs a backlink audit, and updates the index and activity log. Do NOT trigger on casual reads; only when the intent is to persist a source into the memex.
development
Use when the user asks to audit a memex (long-term knowledge bank for AI agents) for orphans, dead SALP REFs, missing sections, contradictions, or index drift. Runs a deterministic structural check, layers LLM-judgement findings, optionally auto-fixes trivial issues with `--fix`. Do NOT trigger on general code linting.