plugins/flow/skills/holdout-validation/SKILL.md
Cross-reference agent self-review claims against actual file state using hidden holdout scenarios, producing mapped P1/P2/P3 findings that reference visible acceptance criteria only. Use when verifying implementation completeness after self-review in start (Phase 4 VERIFY), address (convergence check), or review (parallel fan-out). Also use when an agent claims evidence for a criterion but the file state may not support the claim. This skill MUST be consulted because it detects blind spots in self-review that no other skill catches; a conversational answer cannot systematically test holdout scenarios or cross-reference claims against files.
npx skillsauth add synaptiai/synapti-marketplace holdout-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an independent claim verifier — you cross-reference agent self-review claims against actual file state using hidden holdout scenarios that the executing agent never sees. Your core insight: agents often claim "test added for X" or "error handling covers Y" without the claim being true. You verify the claim against the files.
This skill is adapted from the ai-first-org-design-kit holdout-evaluator but simplified to flow's finding vocabulary (P1/P2/P3) and criterion types (behavioral, api, error, data).
This skill receives three inputs, passed in the prompt by the invoking command:
If any input is missing, note it and evaluate what is available. Do not halt — partial evaluation is better than none.
Determine which criterion types are present in the acceptance criteria and evidence bundle. Load the corresponding scenario files:
templates/holdout-scenarios/behavioral.mdtemplates/holdout-scenarios/api.mdtemplates/holdout-scenarios/error.mdtemplates/holdout-scenarios/data.mdUse relative paths from the flow plugin root. If a scenario file is missing for a criterion type, skip that type and note it.
Assign each scenario an ID by document order (scenario-1, scenario-2, etc.) across all loaded files. Use IDs only — never names — in any output.
For each self-review finding and evidence entry:
Flag bare assertions immediately — they are findings regardless of holdout scenario results.
For each verifiable claim:
Read or GrepFor each loaded holdout scenario, evaluate against the file state and self-review claims:
Convert holdout evaluation results and cross-reference conflicts into flow-standard P1/P2/P3 findings.
Priority mapping:
Output format (all scenarios PASS):
Holdout validation: PASS
No conflicts detected between self-review claims and file state.
Output format (any findings):
Holdout validation: FINDINGS
P1:
- {file:line}: {description of conflict between claim and file state, mapped to visible criterion only}
P2:
- {file:line}: {description of weakness, mapped to visible criterion only}
P3:
- {file:line}: {description of minor gap}
Blocking: {Yes — P1/P2 findings must be fixed before proceeding | No — P3 only}
Security check before outputting: Scan the mapped findings for any holdout scenario names, descriptions, or specifics. If found, rewrite to reference only visible criteria. The findings must pass this test: "Could someone reading these findings determine which specific holdout scenario triggered it?" If yes, generalize further.
When performing this security check, NEVER write out holdout scenario names to demonstrate their absence. Verify using scenario IDs only: "Verified: scenario-1 through scenario-N — no scenario names or descriptions appear in findings."
THE HOLDOUT SET MUST REMAIN HIDDEN. If the executing agent can see the test cases, it optimizes for them specifically — defeating the purpose of holdout validation. Every output from this skill must pass the test: "Could the executing agent reconstruct a holdout scenario from this feedback?" If yes, you have leaked. Rewrite.
| Temptation | Response | |------------|----------| | "I'll mention the scenario name for clarity" | Never. Use criterion numbers and generic descriptions only. | | "I'll list scenario names to prove they're absent" | This IS the leak. Verify using scenario IDs: "scenario-1 through scenario-N checked." | | "The feedback is too vague to be useful" | Map to the visible criterion and describe the weakness generically. The agent has the full acceptance criteria to work from. | | "This scenario doesn't apply" | Still evaluate it. Some failure modes are latent. | | "The agent clearly passed, skip detailed evaluation" | Evaluate every scenario. Thoroughness is the point. |
| Missing | Fallback | |---------|----------| | No scenario files for a criterion type | Skip that type. Note: "No holdout scenarios for {type} criteria." | | No self-review findings provided | Evaluate file state against holdout scenarios only. Note: "Self-review not provided — evaluating files only." | | No evidence bundle provided | Cross-reference cannot verify claims. Evaluate file list against holdout scenarios. | | No file list provided | Halt: "No file list specified. Provide paths to modified files." | | Scenario file unreadable | Skip and note: "Could not load scenarios for {type}." |
This skill is invoked by:
Reads: templates/holdout-scenarios/*.md (hidden scenarios), branch files (ground truth), self-review findings, evidence bundle.
Returns: P1/P2/P3 findings with file:line citations mapped to visible acceptance criteria.
When commands/review.md runs Path A (paired-reviewer protocol with agentTeams: true), this skill is dispatched twice in parallel with different lens prompts:
Both lenses read the same files and the same self-review claims. They differ in priority calibration and in which thin-evidence cases get flagged.
Path A's A.4 consolidator treats holdout findings differently from agent findings:
| Lens behavior | Marker disposition |
|---|---|
| Both lenses raise the same finding (same file, line ±2, priority ±1) | consensus (HIGH confidence) |
| Only one lens raises the finding | unchallenged (MEDIUM confidence) — the lens divergence is itself a signal that the claim is ambiguously evidenced |
Holdout findings NEVER receive validated, refined, or kept dispositions because those are outputs of the A.3 challenge round, which holdout findings do not participate in. The reason is principled, not tooling: adversarial challenge (AGREE/DISAGREE/REFINE) exists for subjective judgment about priority/severity. Holdout findings are objective claim-verification — the file state is the arbiter, not reviewer opinion. Asking a challenger to DISAGREE with "the file does not contain test X" produces either vacuous AGREE responses (re-check confirms what we already established) or confused DISAGREE responses (based on what?). See commands/review.md A.1 and skills/team-coordination/SKILL.md Phase 3 for the full rationale.
In Path B (single-session, default), this skill is invoked once with no lens prompt and emits findings with unchallenged disposition by default — there is no second lens to consensus against.
tools
Validate a FlowWorkflow YAML at `plugins/flow/workflows/<id>.workflow.yaml` against `schemas/v1/workflow.schema.json` AND cross-reference the referenced skills/agents exist + every Tier 3 action is confirm-gated + no native /goal or /loop dependency is declared. Use when /flow:workflow validate is invoked, when CI runs the workflow schema gates, or when a new workflow is being authored. This skill MUST be consulted because schema validation alone catches shape errors; cross-reference validation catches the silent-correctness failures (typo'd skill name, Tier 3 escape, /goal dependency) that would otherwise ship to users.
tools
Verify UI-facing changes by running a screenshot-analyze-verify loop across configured viewports, with a browser-tool priority cascade (Playwright MCP → Chrome DevTools MCP → CLI fallback → external skill fallback) and bounded iteration. Use after build/runtime verification passes and the diff includes `.tsx`/`.jsx`/`.vue`/`.html`/`.css`/`.scss`/`.svelte` files OR the acceptance criteria mention UI/page/render/display/visual. This skill MUST be consulted because UI changes that pass build and unit tests can still ship blank pages, render-blocking console errors, or broken responsive layouts that no other verification phase catches.
data-ai
Coordinate agent teams for adversarial review (paired skeptic/verifier per facet, challenge round with disposition vocabulary, consolidated findings with confidence) or parallel implementation (task sizing 5-6 per teammate, non-overlapping files). Enforces independent analysis before shared conclusions. Reference only (`disable-model-invocation: true`); loaded only when `agentTeams: true` in settings.
development
Conduct two-stage code review: Stage 1 verifies spec compliance (criterion-to-code mapping), Stage 2 evaluates security, correctness, performance, and maintainability across 6 parallel facets with P1/P2/P3 synthesis and deduplication by file:line. Use when reviewing code changes or pull requests. This skill MUST be consulted because reviewing quality on broken logic is wasted effort, and unmet acceptance criteria must block merge.