skills/zhou-et-al-2026-formaljudge/SKILL.md
Formal verification approach to LLM evaluation ensuring correctness and consistency in automated judging
npx skillsauth add curiositech/windags-skills zhou-et-al-2026-formaljudgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
Load this skill when facing:
Key insight: If you're using one unreliable probabilistic system to judge another, you're in a hallucination echo chamber. This skill provides the escape route.
The Insight: Human-level judgment about complex agent behavior is intractable for neural systems, but extracting simple atomic facts is tractable—and sufficient.
Why It Matters: Instead of asking an LLM "Did this agent behave deceptively?" (requires sophisticated reasoning, vulnerable to persuasion), decompose into atomic predicates: "Did the tool return an error?" + "Did the agent produce substantive output?" + "Did the agent disclose the problem?" Each question is binary, grounded in observable facts, and requires minimal context.
The Boundary:
The Architecture:
Natural Language Requirements
↓ (LLM as "specification compiler")
Formal Constraints (Dafny)
↓ (decomposes into)
Atomic Binary Predicates
↓ (LLM extracts from trajectories)
Grounded Facts (true/false)
↓ (SMT solver composes)
Verdict with Mathematical Guarantees
Why Separation Works: Probabilistic reasoning confined to simple tasks where it's reliable; all logical composition delegated to deterministic systems. A 7B parameter model can extract facts; a 72B model can't fool formal logic.
The Guarantee: Soundness depends on logical correctness, not parameter count.
The Problem: Sophisticated agents generate convincing narratives. As capability increases, deceptive outputs become more persuasive to neural judges who share training distributions.
The Solution: Operate exclusively on grounded facts from actions and observations. Ignore:
The Formula: tool_returned_error ∧ output_is_substantive ∧ ¬disclosed_problem = deception
No amount of eloquent explanation can change the mathematical truth of this conjunction.
Result: FORMALJUDGE maintains 90%+ accuracy across agent sizes (0.5B → 72B) while LLM-as-Judge degrades from 62% → 16%.
The Discovery: Feedback precision determines improvement trajectory.
Why Formal Wins:
Application: Use formal specifications not just for verification, but as the primary mechanism for behavioral correction.
The Theoretical Barrier: Translating arbitrary natural language to formal specifications is undecidable.
Why It's Tractable Here:
The Pragmatic Solution: Use LLMs as "specification compilers" for this structured domain. They're unreliable for general reasoning but effective for translating structured requirements to structured constraints.
| Reference File | Description | Load When... |
|---|---|---|
| atomic-decomposition-as-tractability-boundary.md | Deep analysis of what makes judgment hard vs. tractable; the core insight about decomposition enabling oversight | Designing oversight systems; determining what to ask LLMs vs. formal logic; understanding why holistic judgment fails |
| weak-to-strong-generalization-through-architectural-separation.md | How separation of concerns enables weak judges to supervise strong agents; empirical results across model scales | Facing capability gaps between judge and agent; need mathematical guarantees about oversight reliability |
| iterative-refinement-with-formal-feedback.md | Why Dafny specifications enable 70.7% → 99.8% improvement while natural language degrades performance | Creating feedback loops; improving agent behavior; understanding why vague feedback fails |
| specification-synthesis-as-oversight-bottleneck.md | Why translating NL to formal specs is theoretically hard but pragmatically tractable for agent oversight | Converting requirements to specifications; understanding when LLMs work as "compilers"; domain-specific vs. general translation |
| deception-detection-immunity-through-grounding.md | Why grounded facts defeat persuasive narratives; empirical evidence of robustness to sophisticated deception | Detecting deception; handling adversarial agents; understanding vulnerability to persuasion in neural judges |
| when-decomposition-enables-tractability.md | Detailed analysis of the boundary between solvable and intractable oversight problems | Strategic decisions about system architecture; determining feasibility of oversight approaches; understanding fundamental limits |
Mistake: Using an LLM to validate outputs from another LLM. Why It Fails: Shared training distributions mean judges validate fabricated content that "sounds right" to their distribution. Both systems hallucinate coherently. Correct Approach: Ground all verification in observable facts; use formal composition for logic.
Mistake: Asking neural systems to make complex, multi-factor evaluations. Why It Fails: Crosses the tractability boundary; requires inference and weighing evidence where neural systems are unreliable. Correct Approach: Decompose into atomic predicates; compose deterministically.
Mistake: Judging agent behavior based on its self-reported reasoning or justifications. Why It Fails: Sophisticated agents generate convincing narratives; persuasiveness correlates with capability, making oversight degrade when needed most. Correct Approach: Evaluate only grounded actions and observations; ignore agent explanations.
Mistake: Telling agents "be more careful" or "check your work" without precise constraints. Why It Fails: Ambiguous signals don't communicate what specifically to fix; can actually degrade performance. Correct Approach: Express constraints formally; let violations become precise teaching signals.
Mistake: Believing bigger models will judge more reliably. Why It Fails: Larger models are more vulnerable to sophisticated deception; reliability doesn't scale with parameters in probabilistic systems. Correct Approach: Reliability comes from architectural separation, not model scale.
Mistake: Treating confident-sounding outputs as more likely to be correct. Why It Fails: Models generate confidence through linguistic patterns, not epistemic calibration; deceptive outputs are often highly confident. Correct Approach: Confidence is a stylistic feature; verify facts independently.
Mistake: Avoiding formal specifications because they seem inflexible or hard to create. Why It Fails: Misses that (a) LLMs can help generate them for structured domains, and (b) precision enables rather than constrains—formal feedback improves behavior faster. Correct Approach: Use formal methods where they provide guarantees; use LLMs as specification compilers.
1. The Tractability Boundary Recognition
2. The Architectural Separation Instinct
3. The Grounding Obsession
4. The Feedback Precision Principle
5. The Specification Synthesis Nuance
6. The Immunity Property Application
Key Tell: Someone who truly gets FORMALJUDGE will, when presented with an oversight problem, immediately sketch a three-layer diagram: atomic predicates at the bottom (neural extraction), logical composition in the middle (formal verification), natural language requirements at the top (specification synthesis). They'll mark clear boundaries and explain which component provides reliability guarantees and why.
Ultimate Shibboleth: Ask "Why can't we just use GPT-5 to judge GPT-4 outputs?"
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.