skills/eval-council/SKILL.md
# Skill: Eval Council > Multi-perspective evaluation agent that validates Build Brief quality, skill outputs, and critical decisions before they proceed downstream. Inspired by the Council/RedTeam/FirstPrinciples thinking tools from Daniel Miessler's PAI system. Evaluation is opt-OUT for active surfaces, not a universal mandate for every overlay. You must justify skipping an active persona or section. --- ## Why This Exists Agents are confident. Confidently wrong is still wrong. The Build B
npx skillsauth add bigeasyfreeman/adlc skills/eval-councilInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-perspective evaluation agent that validates Build Brief quality, skill outputs, and critical decisions before they proceed downstream. Inspired by the Council/RedTeam/FirstPrinciples thinking tools from Daniel Miessler's PAI system. Evaluation is opt-OUT for active surfaces, not a universal mandate for every overlay. You must justify skipping an active persona or section.
Agents are confident. Confidently wrong is still wrong.
The Build Brief Agent produces a technical design. Seven downstream skills consume it. Autonomous coding agents execute against it. If the brief has a flawed assumption, a missed dependency, a vague acceptance criterion, or a security blind spot — that flaw propagates through the entire chain and becomes a bug, an outage, or a rework cycle.
The Eval Council catches failures before they propagate. It is the last structured review before humans approve and machines execute. Core personas always run; overlay personas activate from the task applicability manifest when the change surface warrants them.
Opt-OUT, not opt-IN for active surfaces. Every Build Brief and every skill output with an active surface gets evaluated by default. The burden of proof is on exclusion. "It looks fine" and "we're in a hurry" are not valid reasons to skip evaluation. Valid reasons: "This is a trivial config change with no behavior change" or "This output is identical to a previously approved output." Inactive overlays still require a concrete not_applicable reason.
The Eval Council runs at these points in the ADLC lifecycle:
| Checkpoint | What's Evaluated | Blocking? | |-----------|-----------------|-----------| | Post-Brief | Complete Build Brief before engineer review | Yes — brief cannot be presented for approval until council passes | | Post-Repo-Analysis | Codebase Research repo map before it feeds downstream | Yes — downstream skills consume this; errors compound | | Post-Skill-Output | Each skill's output before it's published/committed | Configurable — blocking for work-item/document emitters, non-blocking for scaffolding | | Pre-Deploy | Aggregated state: all tickets done, tests pass, runbook exists | Yes — deploy gate | | Post-Incident | Retrospective: did the brief predict this failure mode? | No — learning loop, feeds back into future briefs |
The Eval Council evaluates active surfaces through core personas and overlay personas. Core personas ask different questions and catch different failure modes. Overlay personas activate only when the applicability manifest says the surface exists. They do not collaborate — they evaluate independently, then their verdicts are synthesized.
Core personas always run. Overlay personas activate from change_surface flags in the applicability manifest. This table is the authoritative mapping consumed by the deterministic council_personas evaluator. Any change here must be mirrored in tests/backtest/evaluators/council_personas.sh.
| Persona | Role | Trigger Expression |
|---|---|---|
| skeptic | Core | Always active |
| executioner | Core | Always active |
| first_principles | Core | Always active |
| architect | Overlay | service_boundary_change OR external_integration OR api_change OR data_format_change |
| operator | Overlay | runtime_path_change OR user_facing_operation |
| security_auditor | Overlay focus (expands Skeptic) | new_attack_surface OR auth_change OR external_integration |
Suppressed overlays must carry a concrete not_applicable reason tied to manifest evidence, per the opt-OUT policy.
Perspective: System design, patterns, boundaries, blast radius. Activation: Overlay persona. Only active when architecture, interface, or wiring changes are in scope.
Asks:
Catches: Pattern violations, unnecessary complexity, coupling risks, missed service boundaries, over-engineering.
Evaluates: Build Brief (Sections 2, 3), Codebase Research (architecture, services), Architecture Scaffolding output.
Perspective: What can go wrong? What assumptions are unverified? Where are we lying to ourselves? Activation: Core persona. Always active; security focus expands when the security overlay is active.
Asks:
Security Auditor Focus: Validates STRIDE is complete AND security concerns table has specific (not generic) mitigations for every identified attack vector. "Use encryption" is not a mitigation — "Encrypt PII fields at rest using AES-256 via the existing EncryptionService at src/lib/encryption.ts" is a mitigation. Every STRIDE category must be addressed with evidence from the codebase, not theoretical hand-waving.
Catches: Optimism bias, unverified assumptions, incomplete failure analysis, security blind spots, "it'll be fine" thinking, incomplete STRIDE coverage, generic security mitigations.
Evaluates: Build Brief (Sections 4, 5, 11, STRIDE threat model, security concerns table), Incident Runbook, QA Test Data (are edge cases actually edgy?).
Perspective: Production reality. On-call at 2am. Monitoring. Debugging. Activation: Overlay persona. Active when runtime paths, deployment behavior, or user-facing operations change.
Asks:
Catches: Missing observability, vague runbooks, unrealistic SLOs, phantom escalation paths, "we'll add monitoring later."
Evaluates: Build Brief (Section 6), Incident Runbook, CI/CD Pipeline (deploy gates, rollback), Codebase Research (observability section).
Perspective: Can a coding agent actually build this? Is every task self-contained and unambiguous? (Aligned with Spec Driven Development's core principle: if the agent has to guess, the task isn't ready.) Activation: Core persona. Always active.
Asks:
Self-containment test (applied to every task):
SELF-CONTAINMENT CHECK for [Task ID]:
│ Deliverable described without external references? [PASS/FAIL]
│ File paths to modify/create are explicit? [PASS/FAIL]
│ Pattern named with reference implementation path? [PASS/FAIL]
│ Acceptance criteria in Given/When/Then? [PASS/FAIL]
│ Dependencies explicit by task ID? [PASS/FAIL]
│ Could a cold-start coding agent execute this? [PASS/FAIL]
Any FAIL = the task is not agent-ready. This is a major finding.
Catches: Vague tasks, untestable acceptance criteria, missing pattern references, implicit dependencies, tasks that are actually 3 tasks pretending to be 1, missed parallelism opportunities.
Evaluates: Build Brief task breakdown, emitted work-item tickets or issues, QA Test Data fixtures, Architecture Scaffolding (are contracts and implementation guides complete enough to code against without placeholders?).
Perspective: Are we solving the right problem? Are we building what we should build? Activation: Core persona. Always active.
Asks:
Catches: Scope creep dressed as requirements, over-engineering, misclassified decisions, "Phase 1" that's actually Phase 1+2+3, cargo-culted patterns.
Evaluates: Build Brief (Sections 1, 7, 10), all decision logs.
Before any persona evaluates, validate the payload against the contract schemas and run only the cheapest structural checks. Do not spend council tokens on pure presence checks.
specificity-judge, plus verifier-semantic-judge when the deterministic intersection check is non-empty.Presence checks such as task_classification, change_surface, applicability_manifest, and verification_spec are satisfied by JSON schema validation at Gate 0 entry. Do not ask an LLM whether a required field exists.
docs/schemas/build-brief.schema.json.docs/schemas/applicability-manifest.schema.json.schema_validation = fail.For every task, run specificity-judge with:
artifact_typedecision_contractreference_implfiles_to_modifyfiles_to_createtech_debt_boundariescompatibility_contractevidence_responsibilitiesdefinition_of_doneverification_spec.primary_verifier.targetverification_spec.primary_verifier.target_filesverification_spec.primary_verifier.expected_failure_modeThresholds:
score >= 0.8 -> pass0.6 <= score < 0.8 -> warnscore < 0.6 -> revise with reason low_specificityThere is no counting fallback. If fast_judge is unavailable for the active runtime, emit stuck with reason specificity_judge_unavailable.
For every task, compare verification_spec.target_files against the union of files_to_modify and files_to_create.
target_files is set and the intersection is non-empty, Gate 0 passes the mechanical screen and then must run verifier-semantic-judge to confirm the verifier exercises the semantic change rather than merely touching the file.target_files is set and the intersection is empty, Gate 0 fails with verdict REVISION_REQUIRED and reason verifier_no_coverage.target_files is unset on a Build Brief task, Gate 0 fails with verdict REVISION_REQUIRED and reason verifier_scope_unset.This check is mechanical. Do not waive it with prose if the verifier names files that the task does not touch.
For every task artifact:
scope_lock_epic must be context-only and must not include executable file-change instructions.decision_gate must carry an unresolved Type 1 decision, owner, deadline, blocked implementation IDs, and the exact question to resolve.implementation_task must not have an unresolved Type 1 decision or blocks_implementation = true.validation_task must own verifier execution, evidence capture, compatibility checks, and final Definition of Done proof.GATE 0 PRE-CHECK:
│ Schema validation: PASS / FAIL
│ Specificity judge: PASS / WARN / REVISE / STUCK
│ Verifier scope check: PASS / WARN / FAIL
│ Verifier semantic check: PASS / SKIP / FAIL
│ Gate 0 verdict: PASS / FAIL / STUCK
│
│ If FAIL — return immediately with list of failed checks.
│ If STUCK — return immediately with the missing-judge reason.
│ Do NOT proceed to council evaluation.
For each evaluation checkpoint, determine which personas are relevant from the applicability manifest. Core personas are active by default. Overlay personas are active only when the manifest says the surface exists. To exclude an active persona, you must provide a justification:
EVAL COUNCIL ASSESSMENT (manifest-aware):
│ Architect: INCLUDE — interface or service boundary changed
│ Skeptic (Red Team): INCLUDE — always active core failure analysis
│ Operator: EXCLUDE — no runtime path or user-facing operation changed
│ Executioner: INCLUDE — autonomous execution required
│ First Principles: INCLUDE — always active core scope check
"Too simple" is not a valid exclusion. Simple tasks can have hidden assumptions. "Already reviewed by a human" is valid only if the human review was documented. Inactive overlays still require explicit not_applicable reasons.
Each active persona evaluates the output independently. They do not see each other's evaluations. This prevents anchoring bias.
Each persona produces:
{
"persona": "architect | skeptic | operator | executioner | first_principles",
"verdict": "PASS | CONCERN | FAIL",
"findings": [
{
"severity": "critical | major | minor | observation",
"location": "string (section, task ID, file path, or line reference)",
"finding": "string (what's wrong)",
"evidence": "string (why this is wrong — cite specific content)",
"recommendation": "string (how to fix it)",
"blocks_approval": true
}
],
"confidence": 0.0-1.0,
"summary": "string (1-2 sentences)"
}
Individual evaluations are synthesized into a council verdict:
| Condition | Council Verdict | Action |
|-----------|----------------|--------|
| All personas PASS | APPROVED | Proceed to next stage |
| Any persona has CONCERN, none FAIL | APPROVED WITH CONCERNS | Proceed, but concerns logged and tracked |
| Any persona FAIL with critical finding | BLOCKED | Cannot proceed until critical findings resolved |
| Any persona FAIL with major finding | REVISION REQUIRED | Author must address findings, re-submit for eval |
| Multiple personas FAIL | BLOCKED | Full re-evaluation required after revision |
The council produces a structured report. It records which overlays were active, which were suppressed, and why:
## Eval Council Report: [Feature Name]
**Checkpoint:** [Post-Brief | Post-Repo-Analysis | Post-Skill | Pre-Deploy]
**Evaluated:** [what was evaluated]
**Date:** [ISO date]
**Verdict:** [APPROVED | APPROVED WITH CONCERNS | REVISION REQUIRED | BLOCKED]
### Persona Verdicts
| Persona | Verdict | Critical | Major | Minor | Confidence |
|---------|---------|----------|-------|-------|------------|
| Architect | PASS | 0 | 0 | 1 | 0.85 |
| Skeptic | CONCERN | 0 | 2 | 0 | 0.70 |
| Operator | PASS | 0 | 0 | 0 | 0.90 |
| Executioner | FAIL | 1 | 1 | 3 | 0.80 |
| First Principles | EXCLUDE | — | — | — | — |
| | | | | | |
| **Council** | **REVISION REQUIRED** | **1** | **3** | **4** | |
### Critical Findings (must resolve)
**[C-001] Executioner:** Task BE-003 "Add widget endpoint" has no acceptance criteria for error responses.
- **Location:** Section 8, Backend, Task 3
- **Evidence:** Acceptance criteria says "returns 201 on success" but does not specify 400, 401, 404, or 409 behavior.
- **Recommendation:** Add: "Returns 400 with validation errors on invalid input. Returns 401 if unauthenticated. Returns 409 if widget name conflicts."
- **Blocks:** Yes — coding agent will not handle errors without this.
### Major Findings (should resolve)
**[M-001] Skeptic:** Rollback plan assumes feature flag exists, but no task creates the feature flag.
- **Location:** Section 4, Rollback Mechanism
- **Evidence:** Rollback mechanism says "disable via LaunchDarkly flag" but no task in Section 8 creates this flag.
- **Recommendation:** Add task: "Create LaunchDarkly flag `widget-v2-enabled` with kill switch targeting."
**[M-002] Skeptic:** Failure mode FM-003 (database timeout) lists prevention as "add connection pooling" but connection pooling already exists per repo map.
- **Location:** Section 11, FM-003
- **Evidence:** Codebase Research `data_layer` shows Prisma connection pool configured in `src/server/config.ts`.
- **Recommendation:** Update FM-003 prevention to address the actual risk (query optimization, read replica, or timeout tuning).
**[M-003] Executioner:** Backend task 5 estimated at 2h but requires schema migration + backfill + rollback script — this is 3 tasks.
- **Location:** Section 8, Backend, Task 5
- **Recommendation:** Decompose into: (a) migration script 1h, (b) backfill script 1h, (c) rollback script 1h.
### Minor Findings (consider resolving)
[Listed with same format, lower urgency]
### Observations (informational)
[Non-actionable notes for context]
The council evaluates the complete Build Brief against these criteria:
Spec/Plan/Task Separation (Spec Driven Development)
not_applicable with a concrete reasonCompleteness
not_applicable)Consistency
Task Self-Containment & Parallelism
Executability
Per-skill criteria (in addition to each skill's own quality gates):
| Skill | Key Eval Criteria | |-------|------------------| | Work-item emitter | Every task from brief has a ticket or issue. No work item exceeds 2h. Dependencies are linked. | | Document emitter | Page hierarchy matches structure. Diagrams or source blocks render acceptably. Type 1 warnings visible. | | QA | Tests are deterministic (no randomness). Every AC has a test. Edge cases are real edge cases. | | CI/CD | Workflows match repo conventions. Secrets exist. Rollback mechanism matches brief. | | Scaffolding | Ports match existing conventions. Domain has no infra imports. Files compile. | | Runbook | Every step is executable (commands, not descriptions). Escalation path has names. |
The Eval Council uses the Codebase Research repo map as ground truth for verification. This is what makes evaluation concrete rather than theoretical.
Example cross-references:
| Brief Claims | Repo Map Validates |
|-------------|-------------------|
| "We'll add a new endpoint to WidgetRouter" | api_surface.endpoint_groups confirms WidgetRouter exists at stated path |
| "Follow ports-and-adapters pattern" | architecture.pattern confirms this is the convention |
| "Tests will use Vitest" | testing.framework confirms Vitest is the test framework |
| "Deploy via Argo canary" | ci_cd.deployment_strategy confirms Argo Rollouts with canary |
| "Rollback via feature flag" | tech_stack.feature_flags confirms LaunchDarkly is in use |
| "Error rate < 0.1% target" | observability.existing_slos shows whether any SLOs exist to baseline against |
| "Schema migration for widgets table" | data_layer.models confirms Widget model exists; data_layer.migration_patterns shows how migrations are done |
If the brief claims something that contradicts the repo map, the Skeptic persona flags it as a critical finding.
evaluate_output{
"name": "evaluate_output",
"description": "Run the Eval Council against a Build Brief, skill output, or deployment state",
"inputSchema": {
"type": "object",
"properties": {
"checkpoint": {
"type": "string",
"enum": ["post_brief", "post_repo_analysis", "post_skill_output", "pre_deploy", "post_incident"],
"description": "Which evaluation checkpoint"
},
"content": {
"type": "string",
"description": "The content to evaluate (markdown, JSON, or YAML)"
},
"repo_map": {
"type": "string",
"description": "Repo map JSON from Codebase Research skill (used as ground truth)"
},
"exclude_personas": {
"type": "array",
"items": {"type": "string"},
"description": "Personas to exclude WITH justification objects"
},
"previous_findings": {
"type": "string",
"description": "Previous eval report (for re-evaluation after revision)"
}
},
"required": ["checkpoint", "content"]
}
}
evaluate_decision{
"name": "evaluate_decision",
"description": "Run council evaluation on a specific Type 1 decision",
"inputSchema": {
"type": "object",
"properties": {
"decision_description": {
"type": "string",
"description": "The decision to evaluate"
},
"options": {
"type": "array",
"items": {"type": "string"},
"description": "Available options"
},
"context": {
"type": "string",
"description": "Relevant context from the Build Brief"
},
"repo_map_section": {
"type": "string",
"description": "Relevant repo map section"
}
},
"required": ["decision_description", "options", "context"]
}
}
check_findings_resolved{
"name": "check_findings_resolved",
"description": "Verify that critical and major findings from a previous eval have been addressed",
"inputSchema": {
"type": "object",
"properties": {
"previous_report": {
"type": "string",
"description": "Previous eval council report"
},
"updated_content": {
"type": "string",
"description": "Updated content after revision"
}
},
"required": ["previous_report", "updated_content"]
}
}
# Evaluate a build brief
adlc-eval brief --input ./build-brief.md --repo-map ./repo-map.json --output ./eval-report.md
# Evaluate a specific skill output
adlc-eval skill --skill jira --input ./jira-tickets.json --repo-map ./repo-map.json
# Evaluate with persona exclusion (must provide justification)
adlc-eval brief --input ./build-brief.md --exclude "first_principles:scope locked in Phase 0"
# Re-evaluate after revision
adlc-eval brief --input ./build-brief-v2.md --previous ./eval-report-v1.md
# Evaluate a Type 1 decision
adlc-eval decision --description "Change auth provider from Clerk to Auth0" \
--options "Keep Clerk" "Migrate to Auth0" "Abstract behind port" \
--context ./build-brief.md
# Pre-deploy check
adlc-eval deploy --epic ENG-123 --repo-map ./repo-map.json
# Post-incident learning
adlc-eval incident --incident-report ./incident.md --build-brief ./build-brief.md
The Council is a loop, not a gate. When the Council fails a brief, the Build Brief Agent revises and re-submits automatically. The human never sees a rejected brief. The human sees only the final, Council-approved package.
Build Brief Agent completes
│
├─→ Eval Council (post_brief checkpoint)
│ │
│ ├── APPROVED → proceed to preparation phase
│ │
│ ├── REVISION REQUIRED → ┐
│ │ ├─→ Build Brief Agent revises based on findings
│ │ ├─→ Re-submit to Eval Council
│ │ └─→ Max 3 iterations, then escalate to human
│ │
│ └── BLOCKED (after 3 iterations) → Escalate to engineer with all findings
│
├─→ Skills execute (document emitters, work-item emitters, Scaffolding, QA Tests)
│ │
│ └─→ Eval Council (post_skill_output, per skill)
│ ├── APPROVED → output committed
│ └── REVISION REQUIRED → skill re-runs with findings (auto-retry)
│
├─→ Codegen executes (coding agents make tests pass)
│
├─→ Eval Council (pre_deploy checkpoint)
│ │
│ ├── APPROVED → package ready for human review
│ └── BLOCKED → flag issues, coding agent fixes, re-verify
│
├─→ ════════════════════════════════════════════════
│ SINGLE HUMAN GATE: Engineer reviews complete package
│ - Research deliverable
│ - Build Brief (Council-approved)
│ - Council report (what was validated)
│ - PR with generated code
│ - All tests green
│ - Runbook
│ ════════════════════════════════════════════════
│ │
│ ├── APPROVED → deploy
│ └── CHANGES REQUESTED → loop back to Build Brief with feedback
│
└─→ [If incident occurs post-deploy]
│
└─→ Eval Council (post_incident checkpoint)
│
└── Learning output → feeds back into Build Brief Agent + Council criteria
When the Council returns REVISION REQUIRED:
Typical iteration count:
The Eval Council posts its verdicts through the Slack Orchestration Skill:
APPROVED:
✅ *Eval Council: [Feature Name] — APPROVED*
5/5 personas passed. 2 minor observations logged.
Ready for engineer review.
REVISION REQUIRED:
⚠️ *Eval Council: [Feature Name] — REVISION REQUIRED*
3 critical findings, 2 major findings.
*Critical:*
• [C-001] Task BE-003 has no error handling acceptance criteria
• [C-002] Rollback plan references non-existent feature flag
• [C-003] Schema migration has no rollback script task
@[owner] — findings attached. Please revise and re-submit.
BLOCKED:
🚫 *Eval Council: [Feature Name] — BLOCKED*
Multiple personas flagged critical issues.
*Architect:* Service boundary violation — domain imports infrastructure
*Skeptic:* Rollback plan is not executable
*Executioner:* 6 of 14 tasks have vague acceptance criteria
This brief cannot proceed to approval. @[owner] — see full report.
The Eval Council must also be evaluated. These meta-criteria prevent the council from becoming rubber-stamp theater:
development
Orchestration skill: chains the full ADLC Build Loop. PRD → Brief → Council → Scaffold → Codegen → LDD → TDD → Council → PR. Use when implementing a new feature end-to-end.
development
# Skill: Helm & ArgoCD Deployment > Validates Helm charts and generates ArgoCD Application manifests when the ADLC pipeline produces infrastructure or service code. Ensures every deployable artifact has correct chart structure, environment-specific values, and a GitOps-ready Application manifest before code review. --- ## Why This Exists Without deployment validation in the pipeline, common failures slip through to production: - **Helm charts fail `helm template`** because of missing values,
testing
Decide whether an intersecting verifier actually exercises the semantic change.
development
# Skill: UX Flow Builder > Generates user flow diagrams (Mermaid) from PRD personas and screen specifications. Surfaces dead ends, missing screens, and disconnected flows before design or engineering starts. Helps PMs think in screens, not features. --- ## Trigger - Automatically during PRD Phase 4 (Personas & Flows) to visualize the user journey - On-demand when the PM says "show me the flow" or "map the user journey" - During PRD evaluation to verify screen connectivity --- ## Input ```