.claude/skills/judge-verification/SKILL.md
Independent LLM judge evaluates task completion separately from the executing agent, catching false success claims by reviewing task goal, actions taken, final state, and evidence. Produces PASS/FAIL with confidence score and reasoning.
npx skillsauth add oimiragieo/agent-studio judge-verificationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
An independent LLM evaluation layer that verifies whether a task was genuinely completed. The judge reviews the original task goal, the sequence of actions taken, the final state of relevant artifacts, and the claimed completion evidence — then produces a PASS/FAIL verdict with a confidence score and actionable reasoning.
This skill is distinct from verification-before-completion: that skill runs checklist gates
within the same agent context. Judge-verification uses a fresh, independent perspective
with no access to the executing agent's prior reasoning, catching hallucinated success claims.
Skill({ skill: 'judge-verification' });
Invoke when:
verification-before-completion skill passes but human review is not availableJUDGE IS INDEPENDENT — NO SHARED CONTEXT WITH EXECUTING AGENT
The judge must receive only: (1) the original task goal, (2) the list of actions, (3) the final file states. Never pass the executing agent's reasoning or internal notes to the judge.
The judge reviews four dimensions and produces a combined verdict:
Question: Does the final state match what the task requested?
Evidence to check:
TaskGet({ taskId })Score: 0-25 points
Commands:
# Get the original task goal
# (replace task-ID with actual task ID)
node -e "const fs=require('fs');const tasks=JSON.parse(fs.readFileSync('.claude/context/runtime/tasks.json','utf8')||'[]');const t=tasks.find(x=>x.id==='{{TASK_ID}}');console.log(JSON.stringify(t?.subject||'not found'));"
Expected output: Task subject string showing the original goal. Verify: Subject matches what the agent claimed to accomplish.
Question: Were the claimed actions sufficient to accomplish the goal?
Evidence to check:
filesModified, outputArtifacts)pnpm test)Score: 0-25 points
Commands:
# Check files modified were actually touched
git diff --name-only HEAD~1 HEAD 2>/dev/null || git status --short
Expected output: List of changed files that should match the agent's filesModified metadata.
Verify: At least one file changed; file list is plausible given the task.
Question: Is there concrete, verifiable evidence the task succeeded?
Evidence to check:
Score: 0-25 points
Commands:
# Check if tests pass (if task involved code changes)
cd /c/dev/projects/agent-studio && pnpm test 2>&1 | tail -5
Expected output: Test summary showing pass/fail counts. Verify: Zero failures for tasks that touched tested code.
# Verify target file content matches task intent
# (judge reads the file and checks against task description)
head -50 {{TARGET_FILE_PATH}}
Expected output: File content consistent with the claimed change. Verify: Content is not placeholder/stub; change is real.
Question: Is the system in a coherent state — no regressions, no broken references?
Evidence to check:
Score: 0-25 points
Commands:
# Quick coherence check
cd /c/dev/projects/agent-studio && pnpm lint:fix 2>&1 | tail -10
Expected output: Zero errors, possibly auto-fix count. Verify: Exit code 0 or only style fixes (no logic errors).
totalScore = dim1 + dim2 + dim3 + dim4 (max 100)
PASS: totalScore >= 70 AND dim3 >= 15 (evidence gate — cannot pass with no evidence)
FAIL: totalScore < 70 OR dim3 < 15
CONDITIONAL: totalScore 60-69 with dim3 >= 15 — requires human review
The judge produces a structured verdict:
{
"verdict": "PASS | FAIL | CONDITIONAL",
"confidence": 0.87,
"totalScore": 82,
"dimensions": {
"goalAlignment": 20,
"actionCompleteness": 22,
"evidenceOfCompletion": 20,
"finalStateCoherence": 20
},
"reasoning": "Task goal was to add input validation. Files modified include auth.ts and auth.test.ts. Tests pass. Validation logic present in auth.ts lines 45-67. No regressions detected.",
"failureReasons": [],
"recommendations": ["Consider adding edge case tests for empty string input"]
}
Command:
node -e "
const fs = require('fs');
const taskId = '{{TASK_ID}}';
// Read task context from metadata
const logPath = '.claude/context/runtime/session-gap-log.jsonl';
const lines = fs.existsSync(logPath) ? fs.readFileSync(logPath, 'utf8').split('\n').filter(Boolean) : [];
const relevant = lines.filter(l => l.includes(taskId)).map(l => JSON.parse(l));
console.log(JSON.stringify(relevant.slice(-5), null, 2));
"
Expected output: Recent task log entries with metadata (filesModified, summary). Verify: At least one entry for the task ID.
For each file in filesModified, verify it exists and has non-zero size:
Command:
# For each file claimed as modified:
stat "{{FILE_PATH}}" 2>/dev/null && echo "EXISTS" || echo "MISSING: {{FILE_PATH}}"
Expected output: "EXISTS" for each file. Verify: No MISSING entries — missing files = automatic FAIL for dim2.
Score dimensions 1-4 using the criteria above. Record each score with one-sentence justification.
Expected output: Four scores totaling 0-100. Verify: Total is consistent with the evidence collected.
Apply verdict formula. Check evidence gate (dim3 >= 15 required for PASS).
Command:
const total = dim1 + dim2 + dim3 + dim4;
const verdict =
total >= 70 && dim3 >= 15 ? 'PASS' : total >= 60 && dim3 >= 15 ? 'CONDITIONAL' : 'FAIL';
const confidence = Math.min(1.0, total / 100 + (dim3 >= 20 ? 0.1 : 0));
Expected output: { verdict, confidence, totalScore }.
Verify: Verdict is consistent with the evidence — do not rationalize a PASS without evidence.
Command:
TaskUpdate({
taskId: '{{TASK_ID}}',
status: 'completed', // or keep as-is if just judging
metadata: {
judgeVerdict: {
verdict: '{{VERDICT}}',
confidence: {{CONFIDENCE}},
totalScore: {{SCORE}},
dimensions: { goalAlignment: {{D1}}, actionCompleteness: {{D2}}, evidenceOfCompletion: {{D3}}, finalStateCoherence: {{D4}} },
reasoning: '{{REASONING}}',
failureReasons: [{{FAILURES}}],
recommendations: [{{RECS}}],
judgedAt: new Date().toISOString(),
},
},
});
Expected output: TaskUpdate succeeds with judge verdict in metadata.
Verify: TaskGet({ taskId }) returns metadata.judgeVerdict.verdict.
When verdict is FAIL:
issues.mdWhen verdict is CONDITIONAL:
blocked with blockerType: 'review'This skill is complementary to verification-before-completion, not a replacement:
| Skill | Perspective | When | Catches |
| -------------------------------- | ----------- | -------------------- | ------------------------------------ |
| verification-before-completion | Same agent | Before claiming done | Missing steps in agent's own context |
| judge-verification | Independent | After claiming done | False success, hallucinated evidence |
Use both: verification-before-completion first, then judge-verification for sign-off.
Input validated against schemas/input.schema.json before execution.
Output contract defined in schemas/output.schema.json.
Pre-execute hook at hooks/pre-execute.cjs validates that taskId and taskGoal are provided.
Post-execute hook at hooks/post-execute.cjs emits a judge-verification event to tool-events.jsonl.
TaskUpdate(completed) yet — wait for the completion claimBefore starting: Read .claude/context/memory/learnings.md for past judge verdicts and common failure patterns.
After completing: If verdict is FAIL or CONDITIONAL, append to .claude/context/memory/issues.md:
## Judge Verification FAIL — Task {{TASK_ID}} — [date]
- Verdict: FAIL (score {{N}}/100)
- Failed dimensions: {{DIMS}}
- Root cause: {{REASONING}}
- Recommendation: {{RECS}}
verification-before-completion — Pre-completion checklist (same-agent perspective)behavioral-loop-detection — Detect loops before completion claimerror-recovery-escalation — Handle errors before reaching judgeagent-evaluation — Full LLM-as-judge 5-dimension rubric (broader scope)tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.