framework/engineering/skills/flowai-skill-diagnose-benchmark-failure/SKILL.md
Use when a flowai benchmark fails and you need the cause from run artifacts before editing. Reads judge-evidence.md, the sandbox SKILL.md, and scenario mod.ts; classifies the failure against a known taxonomy; produces an evidence-grounded report (no fixes). Do NOT trigger for passing benchmarks or generic skill iteration.
npx skillsauth add korchasa/flowai flowai-skill-diagnose-benchmark-failureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a benchmark fails, the natural reflex is to edit the SKILL.md and re-run.
That is guessing. The run artifacts (judge-evidence.md, sandbox SKILL.md, the
scenario mod.ts) contain the actual chain of cause and effect: what the agent
saw, what it emitted, what the judge measured. This skill enforces an
evidence-first diagnosis so each iteration moves on facts, not hopes.
flowai-skill-plan-interactive). Inferred from the user
prompt or from the most recent failure in the bench output.<step_by_step>
Locate the run dir
benchmarks/runs/latest/<scenario-id>/run-1/.latest is missing, list benchmarks/runs/ and pick the most
recently modified directory containing <scenario-id>.judge-evidence.md, sandbox/. If either is
missing → fail closed (rule 4).Read judge-evidence.md end to end
<user_query>, <agent_logs>, <file_diffs>.<agent_logs>, extract the last assistant turn that the user
would have seen. This is the agent's actual emitted output.<user_query>, copy the verbatim query the agent received.## Tool: <name>) — especially
Skill, Read, Bash, TodoWrite. The presence/absence of specific
tool calls is itself evidence.Read the scenario mod.ts
framework/<pack>/{skills,commands,agents}/<primitive>/benchmarks/<scenario>/mod.ts.
Find via find framework -path "*/benchmarks/<scenario>/mod.ts".userQuery, userPersona, checklist[] (with id,
description, critical).interactive, any setup() body, agentsTemplateVars.Read the sandbox SKILL.md — BOTH copies, side by side
There are two different SKILL.md files for the same primitive, in two
different locations. You MUST read both and compare them. Confusing them
leads to the wrong classification.
(a) The failing-agent's view (inside the run dir):
<run-dir>/sandbox/.claude/skills/<primitive>/SKILL.md
(Cursor sandbox uses .cursor/skills/; OpenCode uses .opencode/skills/.)
This is the static snapshot the failing agent read. Read this first.
(b) The current framework source:
framework/<pack>/{skills,commands}/<primitive>/SKILL.md
This is the live source on disk now — it may differ from (a).
Do NOT read .claude/skills/<primitive>/SKILL.md at the project root and
call that "the sandbox copy" — it is the current source, identical (or
nearly so) to (b), and tells you nothing about what the failing agent saw.
After reading both, diff (a) vs (b):
The classification depends on what (a) said vs. what the agent actually
emitted in <agent_logs> — not on what (b) currently says.
Re-derive the verdict
judge-evidence.md. If you don't have it, re-run
the scenario with --no-cache and capture stdout. Otherwise, check
each checklist[].description against the agent's last turn from
step 2 and judge yourself before continuing — this catches LLM-judge
calibration drift.Match symptoms to the failure-mode taxonomy (next section). Pick the most likely mode based on the quoted evidence. If two modes fit equally, list both; do not collapse them.
Write the diagnostic report (template below). Every claim cites a quoted line from step 2/3/4.
</step_by_step>
A symptom-to-cause map. Use the symptom column to match what you observed in
judge-evidence.md; the cause column gives the most likely root cause; the
fix-direction column points the next iteration at a real lever, not a
guess. Do not invent new modes unless the evidence rules out every one
listed.
MD-PRIOR-BULLETS (markdown-prior-wins-over-instruction)
1., 2., …); agent
emits bulleted dashes (- **X** —) for option lists with rich
descriptions.HEADING-INSTEAD-OF-ITEM
### Variant A, ### Variant B, or **1. Title** (bold heading).### Variant A is wrong; demonstrate the correct shape with the EXACT
surface form expected.STALE-SKILL-IN-SANDBOX
--no-cache was forgotten on a quick re-run.--no-cache. If the issue persists, check
scripts/benchmarks/lib/cache.ts cache-key inputs vs. what changed.SKILL-NOT-MOUNTED
Agent finished with exit code 0 but 0 agent steps, or judge
reports "Unknown skill" / agent never invokes the skill..claude/skills/<name>/ (most common: missing pack in Copying packs
line; check scenario.skill matches an existing primitive).skill: field, NOT the
SKILL.md.COMPOSITE-DELEGATION-BYPASS
flowai-review-and-commit) was invoked,
but <agent_logs> shows an early ## Tool: Skill { skill: "<source-skill>" }
re-entering one of the inlined sources, bypassing the composite's gate.PERSONA-MISMATCH
[USER INPUT] <reply>
that doesn't fit.TEST-FITTING-PERSONA
CROSS-PACK-REFERENCE-MISSING
Copying packs: line in bench stdout).Produce exactly this structure. Every bullet ends with a (<file>:<line-range>)
citation.
# Diagnostic Report: <scenario-id>
## Run inspected
- Run dir: <path>
- Verdict line: "<paste>"
- Failed checklist items (id + critical?): <list>
## Evidence collected (paths)
- judge-evidence.md — <bytes>, <line count>
- sandbox SKILL.md — <path>, <bytes>
- scenario mod.ts — <path>
## Agent's last assistant turn (verbatim, ≤30 lines)
<paste>
```
(judge-evidence.md:<L1>-<L2>)
<paste>
(<sandbox path>:<L1>-<L2>)
development
Use when the user asks to add TypeScript strict-mode code-style rules to AGENTS.md for a TypeScript project using strict mode. Do NOT trigger for Deno projects (use setup-agent-code-style-deno) or non-strict TS configurations.
development
Use when the user asks to add Deno/TypeScript code-style rules to AGENTS.md, or during initial Deno project setup when code-style guidelines need to be established. Do NOT trigger for non-Deno TypeScript projects (use setup-agent-code-style-strict), or for runtime-agnostic style advice.
testing
Use when the user provides a source (URL, file path, or free text) to save into the project's memex — a long-term knowledge bank for AI agents. Stores the raw source, extracts entities into cross-linked pages, runs a backlink audit, and updates the index and activity log. Do NOT trigger on casual reads; only when the intent is to persist a source into the memex.
development
Use when the user asks to audit a memex (long-term knowledge bank for AI agents) for orphans, dead SALP REFs, missing sections, contradictions, or index drift. Runs a deterministic structural check, layers LLM-judgement findings, optionally auto-fixes trivial issues with `--fix`. Do NOT trigger on general code linting.