plugins/flow/skills/goal-evaluator/SKILL.md
Evaluate a FlowGoal against its evidence ledger and update lifecycle status to one of {pass, incomplete, fail, needs_human_review, blocked} by running deterministic verification commands first, then (when stopHookEnforcement=evaluator-loop or explicit /flow:goal evaluate invocation) dispatching the goal-evaluator-judge agent for fuzzy rubric criteria. Use when /flow:goal evaluate is invoked, when the Stop hook fires in evaluator-loop mode, or when /flow:start Phase 4 needs to convert AC evidence into a verdict. This skill MUST be consulted because lifecycle transitions without deterministic evidence enable silent premature completion — the goal contract is only as good as the evaluator that proves or disproves it.
npx skillsauth add synaptiai/synapti-marketplace goal-evaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You convert a goal's evidence ledger into a verdict and update the goal's lifecycle. This skill wraps criterion-verification-map (which produces per-AC commands at plan time) and adds the loop-time evaluation: run the commands, capture evidence, judge satisfaction, transition state.
Deterministic checks beat LLM judgment when they apply. The LLM judge runs only when the contract has fuzzy rubric criteria that no command can prove. Always run deterministic checks first; never substitute judge output for a runnable command's exit code.
The invoking command/hook MUST pass:
<id> such that .flow/goals/<id>.goal.yaml exists with lifecycle.status == active (or waiting_for_user, waiting_for_ci, blocked — evaluator can resurrect these on resume)..flow/runs/<run-id>/evidence/). If absent, evaluator infers from the goal's scope.run_id.manual | stop-hook | command. Affects whether judge subprocess runs (Stop hook in evaluator-loop mode auto-runs judge; manual invocation runs judge per the goal's evaluator.type)..flow/goals/<id>.goal.yaml with new lifecycle.last_evaluation and possibly new lifecycle.status.*.evidence.yaml sidecars under .flow/runs/<run-id>/evidence/ for each verification command run.goal-evaluation artifact appended to the linked decision journal.status transitions from pending → evidence_collected → pass | fail; evidence_ref set to the new sidecar path.Read .flow/goals/<id>.goal.yaml. Verify schema. Read existing evidence sidecars under .flow/runs/<run-id>/evidence/ for any AC with evidence_ref already set.
For each AC where verification_command is set and must_pass is true OR all-pass evaluation is required:
# Capture stdout + exit code
OUTPUT=$(mktemp)
bash -c "${AC.verification_command}" > "$OUTPUT" 2>&1
EXIT_CODE=$?
Then assemble a FlowEvidence YAML and write via bin/flow-record-evidence.sh:
apiVersion: flow.synapti.ai/v1
kind: FlowEvidence
metadata:
id: evidence-<AC.id>-eval-<turn>
goal: <goal-id>
run_id: <run-id>
created_at: <now>
evidence:
type: command_result
command: <AC.verification_command>
exit_code: <captured>
output_ref: <relative path to .txt copy>
proves:
- <AC.id>
limitations:
- <list from criterion-verification-map's "Does NOT promise" field if present>
Update the AC entry: status: evidence_collected, evidence_ref: <sidecar path>, last_evaluated_at: <now>, last_result: <exit-code or summary>.
After all deterministic checks:
must_pass: true have exit_code == 0 → status candidate = pass.must_pass: true AC with exit_code != 0 → status candidate = fail.verification_command (= fuzzy criterion) → status candidate = incomplete (LLM judge required).If the goal has constraints.allowed_paths, run git diff --name-only (current branch vs. base). Any modified file outside allowed_paths → emit a path_boundary_check FlowEvidence with proves: [] and the violating filenames; transition status to blocked with reason path_boundary_violation.
Run the judge subprocess ONLY when:
evaluator.type == hybrid AND deterministic candidate is incomplete (= fuzzy criteria remain), ORevaluator.type == flow_verdict_judge and the user explicitly invoked /flow:goal evaluate (manual review).Spawn Agent(goal-evaluator-judge) with:
plugins/flow/references/evidence-bundle-format.md)denied_context list (passed verbatim)The judge returns verdict + confidence + delta + next_step_hint as a structured table.
| Deterministic candidate | Judge verdict | Final lifecycle.status |
|---|---|---|
| pass (all must_pass green, no fuzzy) | (judge skipped) | achieved |
| pass + fuzzy criteria | achieved | achieved |
| pass + fuzzy criteria | not_achieved | active (continue) |
| fail | (judge may run for context) | active (continue, surface failing AC) |
| incomplete | not_achieved | active |
| incomplete | blocked (with blocker_type) | blocked |
| incomplete | needs_human_review | waiting_for_user |
| path_boundary_violation | (judge skipped) | blocked |
Non-terminal transitions (active, blocked, waiting_for_user, waiting_for_ci):
Update lifecycle.status, lifecycle.turns_evaluated += 1, lifecycle.last_evaluation = {result, reason, at}. Write back via bin/flow-goal-record.sh immediately.
Terminal transitions (achieved, failed, cancelled) — F10 contract:
The skill does NOT write the terminal status itself. Instead, it returns proposed_transition: {to: <achieved|failed|cancelled>, reason: ..., turns_evaluated: ...} in its structured response and leaves the goal's persisted lifecycle.status at its current non-terminal value. The caller is responsible for invoking AskUserQuestion and, on user confirmation, calling bin/flow-goal-record.sh --update-lifecycle to write the terminal status.
The Stop-hook evaluator-loop path is an exception: when the hook calls this skill (or the deterministic path produces a terminal verdict), Tier 2 confirmation cannot run inside the hook (no AskUserQuestion in hook context). The hook persists the verdict via bin/flow-record-verdict.sh and emits a decision: "approve" with a next_step_hint pointing to /flow:goal evaluate <id> — the user explicitly confirms via the command path on the next turn.
bin/journal-record.sh --issue {N} --type goal-evaluation \
--metadata goal_id=<id> \
--metadata result=<lifecycle.status> \
--metadata evidence_bundle=<run-dir relative path> \
--metadata failures=<comma-list of failing AC ids or 'none'>
This skill does NOT write .flow/runs/<run-id>/last-verdict.json. The skill computes the verdict (verdict, confidence, delta, reason, next_step_hint, criterion_results) and returns it to the calling command or hook. The caller is the single owner of verdict persistence.
Callers responsible for the write (one per invocation context):
/flow:goal evaluate <id> (commands/goal.md) — invokes bin/flow-record-verdict.sh after the skill returns. source: "command".hooks/scripts/flow-goal-evaluator.sh (Stop-hook evaluator-loop mode) — invokes bin/flow-record-verdict.sh via its internal _record_verdict() helper after the judge subprocess returns. source: "evaluator-loop".Contract:
bin/flow-record-verdict.sh. Centralizing persistence in the caller prevents the double-write where the skill wrote first and the command's heredoc immediately overwrote — with the skill's source: "skill" silently lost.bin/flow-record-verdict.sh and MUST handle helper failure as non-fatal (surface to stderr via ||; do NOT abort the evaluation; the in-memory verdict is still correct, only next-turn delta semantics are lost).Why this split: Three callers (skill, command, hook) writing through the same helper produced last-writer-wins races. Two callers (command, hook) with no skill-side write is race-free.
If trigger == stop-hook AND the new pass-set hash matches the previous turn's hash for flow.goals.failAfterStuckTurns consecutive turns (default 3), transition status to failed with reason stuck_no_progress. This prevents the evaluator loop from churning indefinitely on a goal that can't make forward progress.
lifecycle.status without writing a goal-evaluation artifact — breaks audit trail.pass without an evidence_ref — bypasses the evidence ledger.allowed_paths is set — goals exist to fence scope.plugins/flow/skills/criterion-verification-map/SKILL.md — AC → verification command shape.plugins/flow/agents/verdict-judge.md — independence protocol the LLM judge inherits.plugins/flow/agents/goal-evaluator-judge.md — the specialized judge this skill dispatches.plugins/flow/bin/flow-record-evidence.sh — atomic evidence sidecar writes.plugins/flow/bin/flow-goal-record.sh — atomic goal lifecycle updates.plugins/flow/bin/flow-record-verdict.sh — last-verdict.json producer; Step 8 invokes this.plugins/flow/references/evidence-bundle-format.md — canonical evidence layout.tools
Validate a FlowWorkflow YAML at `plugins/flow/workflows/<id>.workflow.yaml` against `schemas/v1/workflow.schema.json` AND cross-reference the referenced skills/agents exist + every Tier 3 action is confirm-gated + no native /goal or /loop dependency is declared. Use when /flow:workflow validate is invoked, when CI runs the workflow schema gates, or when a new workflow is being authored. This skill MUST be consulted because schema validation alone catches shape errors; cross-reference validation catches the silent-correctness failures (typo'd skill name, Tier 3 escape, /goal dependency) that would otherwise ship to users.
tools
Verify UI-facing changes by running a screenshot-analyze-verify loop across configured viewports, with a browser-tool priority cascade (Playwright MCP → Chrome DevTools MCP → CLI fallback → external skill fallback) and bounded iteration. Use after build/runtime verification passes and the diff includes `.tsx`/`.jsx`/`.vue`/`.html`/`.css`/`.scss`/`.svelte` files OR the acceptance criteria mention UI/page/render/display/visual. This skill MUST be consulted because UI changes that pass build and unit tests can still ship blank pages, render-blocking console errors, or broken responsive layouts that no other verification phase catches.
data-ai
Coordinate agent teams for adversarial review (paired skeptic/verifier per facet, challenge round with disposition vocabulary, consolidated findings with confidence) or parallel implementation (task sizing 5-6 per teammate, non-overlapping files). Enforces independent analysis before shared conclusions. Reference only (`disable-model-invocation: true`); loaded only when `agentTeams: true` in settings.
development
Conduct two-stage code review: Stage 1 verifies spec compliance (criterion-to-code mapping), Stage 2 evaluates security, correctness, performance, and maintainability across 6 parallel facets with P1/P2/P3 synthesis and deduplication by file:line. Use when reviewing code changes or pull requests. This skill MUST be consulted because reviewing quality on broken logic is wasted effort, and unmet acceptance criteria must block merge.