framework/devtools/skills/flowai-skill-write-agent-benchmarks/SKILL.md
Create, maintain, and run evidence-based benchmarks for AI agents. Use when setting up testing infrastructure, writing new test scenarios, or evaluating agent performance.
npx skillsauth add korchasa/flow flowai-skill-write-agent-benchmarksInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill defines a universal, language-agnostic standard for benchmarking Autonomous AI Agents. The goal is to objectively measure an agent's ability to solve real-world tasks, whether they are coding, data analysis, or conversational.
The system supports three primary evaluation modes:
Quality Evaluation (Checklist-based):
Model Selection (Pairwise Comparison):
Version Comparison (Regression Tracking):
Choosing the right interaction strategy is critical for stable benchmarks.
A robust benchmarking system consists of five key modules.
The isolated state container where the task is performed. It is not limited to a file system.
Setup (initial state), Reset (between runs), and Teardown.The central controller managing the test lifecycle.
For interactive agents that ask clarifying questions.
The logic that determines if a test passed or failed based on Evidence.
Complete capture of the agent's lifecycle in a single human-readable file (e.g., trace.md or trace.json).
Follow this process to add a new benchmark scenario.
What specific capability are you testing?
Create the initial state.
Write the prompt that instructs the agent.
How do we know it worked?
script.py runs with exit code 0.users has 1 new row.GET /api/v1/flights with correct parameters.Add the scenario to your Runner's registry.
If a benchmark fails, check the Trace:
When a scenario fails, especially a verbatim_relay / mock-reached-agent
check, do NOT jump straight to rewriting SKILL.md. The test infrastructure
itself is the most common culprit and the silent-failure mode is real
(see the flowai bench history: PreToolUse key camelCase typo silenced
all Claude-adapter mocks for months, and "passing" scenarios were passing
on pattern-matching, not hook interception).
Run this checklist first:
ls <sandbox>/.claude/hooks/ should show
mock-<tool>.sh. Absence → adapter didn't call setupMocks, or
sandbox was overwritten.cat <sandbox>/.claude/settings.local.json
— the top-level event key MUST be PascalCase (PreToolUse,
PostToolUse). Claude Code silently ignores camelCase; no warning.CLAUDECODE="" claude -p …), a naive
Bash(<tool>:*) matcher will NOT fire — first token is the env
assignment. The flowai adapter uses a broad Bash matcher plus
in-script filtering; replicate that pattern if you add a new adapter.[benchmock-<6-hex>]) that is guaranteed absent from the
skill's own SKILL.md and examples. Then grep the judge output for
the sentinel: present → hook fired and agent quoted it;
absent → synthesis or pattern-match, NOT relay. This is the only
robust signal that a mock actually reached the agent.MOCK: prefix"). Mock-prefix
scaffolding is not something real CLIs emit; teaching the agent to
preserve it corrupts real-world behaviour. Design the mock so its
distinctive content is what proves the relay — not the framing.Only after steps 1–5 pass is it safe to conclude the skill itself is at fault and edit SKILL.md. Skipping this checklist and iterating on the skill text wastes bench cycles and frequently introduces regressions (e.g. "mandatory capture-to-file" rules that don't help because hooks block before the shell redirect executes).
Execution scenarios prove "when skill X runs, it works." They do NOT prove that the model picks skill X for a relevant query, or that it stands down for an unrelated one. Trigger scenarios close that gap: they verify description-matching correctness.
framework/<pack>/skills/flowai-skill-*/. Commands (commands/) carry disable-model-invocation: true and are out of scope.framework/<pack>/skills/<skill-id>/benchmarks/
trigger-pos-1/mod.ts
trigger-adj-1/mod.ts
trigger-false-1/mod.ts
<skill-id>-trigger-<pos|adj|false>-1 (the trailing -1 is preserved for backward compatibility with trace tooling; only n=1 is permitted).scripts/check-trigger-coverage.ts (wired into deno task check) fails if any of the 3 are missing, or if stray trigger-{type}-{2,3,...} directories exist.With N=1, each query carries the full description-match weight for its class — pick the phrasing most likely to expose a description regression.
trigger-pos-1): a natural, short user query that matches the skill's description. No /skill-name prefix (that bypasses description-matching), no over-specified jargon, no hints at internal mechanics. Pick the phrasing a typical user would write — the least-jargonized form — so the test stresses description match, not exact wording.trigger-adj-1): a query for which a different, neighboring skill is the correct match. Pick the most-likely confusion candidate from the same pack or with overlapping vocabulary. Typical confusion patterns: a "fix this test" skill vs. a "review my diff" skill (overlap on "I broke something"); a single-task planner vs. a multi-phase epic planner (overlap on "plan"); a current-session reflection vs. a historical-sessions reflection (overlap on "reflect").trigger-false-1): a query inside the skill's general domain but with the wrong intent. Recommended patterns: surface vocabulary that matches but the actual ask is something else (e.g., a planning skill receiving "plan" in a non-software-task sense; a fix-tests skill receiving a "speed up the test runner" perf request); reverse-intent traps (e.g., write new tests vs fix failing ones). Do NOT use meta-questions about the skill itself ("what does X cover?", "how does X work?", "when should I use X?") as false-use — under Claude Code these are legitimately answered by reading the skill's SKILL.md, so the agent will rightly load it and the judge will record activation. Treat meta-questions as positives or omit them.Every trigger scenario carries exactly one critical checklist item.
trigger-pos-*):
checklist = [{
id: "skill_invoked",
description: "Did the agent load and act on `<skill-id>` in response to this query? Look in the trace for a `Skill` tool call or a read of the skill's `SKILL.md` for `<skill-id>`.",
critical: true,
}];
trigger-adj-* and trigger-false-*):
checklist = [{
id: "skill_not_invoked",
description: "Did the agent AVOID loading `<skill-id>`? For this query the skill is not appropriate; the agent should either invoke a different skill or respond directly without reading `<skill-id>/SKILL.md` or calling the `Skill` tool with `<skill-id>`.",
critical: true,
}];
mod.ts)import { AcceptanceTestScenario } from "@acceptance-tests/types.ts";
export const TriggerPos1 = new class extends AcceptanceTestScenario {
id = "<skill-id>-trigger-pos-1";
name = "<short label, e.g. 'natural fix-tests query'>";
skill = "<skill-id>";
agentsTemplateVars = { PROJECT_NAME: "Sandbox" };
userQuery = "<natural user query>";
checklist = [{
id: "skill_invoked",
description:
"Did the agent load and act on `<skill-id>` in response to this query? Look in the trace for a `Skill` tool call or a read of the skill's `SKILL.md` for `<skill-id>`.",
critical: true,
}];
}();
Before scaling, write one positive scenario and run it; confirm the judge correctly fails the run when the skill's description is mangled to be unrelated. Then revert the description. This validates the pattern end-to-end. See SRS FR-ACCEPT.TRIGGER, SDS §3.4.2.
To ensure cross-platform compatibility, benchmark results must follow a standard JSON schema.
{
"scenario_id": "string",
"outcome": "pass|fail",
"score": 0-100,
"metrics": {
"duration_ms": 1200,
"cost_usd": 0.01,
"steps_taken": 5,
"tokens_used": 1500
},
"evidence": {
"artifacts": ["file_paths"],
"logs": ["log_entries"]
},
"checklist": [
{ "id": "check_1", "status": "pass", "reason": "..." }
]
}
temperature: 0).development
Use when the user asks to add TypeScript strict-mode code-style rules to AGENTS.md for a TypeScript project using strict mode. Do NOT trigger for Deno projects (use setup-agent-code-style-deno) or non-strict TS configurations.
development
Use when the user asks to add Deno/TypeScript code-style rules to AGENTS.md, or during initial Deno project setup when code-style guidelines need to be established. Do NOT trigger for non-Deno TypeScript projects (use setup-agent-code-style-strict), or for runtime-agnostic style advice.
testing
Use when the user provides a source (URL, file path, or free text) to save into the project's memex — a long-term knowledge bank for AI agents. Stores the raw source, extracts entities into cross-linked pages, runs a backlink audit, and updates the index and activity log. Do NOT trigger on casual reads; only when the intent is to persist a source into the memex.
development
Use when the user asks to audit a memex (long-term knowledge bank for AI agents) for orphans, dead SALP REFs, missing sections, contradictions, or index drift. Runs a deterministic structural check, layers LLM-judgement findings, optionally auto-fixes trivial issues with `--fix`. Do NOT trigger on general code linting.