skills/mine/agent-output-audit/SKILL.md
Audits AI-implemented work for honest completion. Runs independent-evaluator checks against task artifacts, transcripts, tests, CI evidence, requirement-to-test mapping, status front matter, and quality gates; flags skipped tests, weakened assertions, mock-only confidence, snapshot drift, happy-path-only coverage, flaky retries, and status/evidence mismatches. Use when validating completed Compozy tasks, AI-authored PRs, or codex-loop iterations. Do not use for real-user QA, persona/journey testing, exploratory charters, or product usability sessions; use qa-execution for those.
npx skillsauth add pedronauck/skills agent-output-auditInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Independent verification of AI-implemented work. The skill that asks: "Did the implementing agent actually do what task_NN.md says it did?" — not "Would a real user succeed at this product?" (that's qa-execution).
Match your task to the row. Read the listed files in full before producing output. They are not appendices — they are load-bearing. Inline content in this SKILL.md is a pointer, not a substitute.
| Task | MUST read |
| -------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| Discovering install/lint/test/build/start commands (Step 1) | references/project-signals.md |
| Deciding E2E support and classifying coverage (Step 1) | references/e2e-coverage.md |
| Building the audit scope checklist (Step 2) | references/checklist.md |
| Holding independent-evaluator stance on AI tasks (Step 3) | references/independent-evaluator-protocol.md |
| Scanning test diffs for AI hygiene red flags (Step 4) | references/ai-implementation-audit.md |
| Diagnosing a test that passed on retry without a code change | references/flaky-triage.md |
references/project-signals.md — Heuristics for picking install/lint/test/build/start commands across ecosystems when the repo lacks an umbrella gate.references/e2e-coverage.md — Taxonomy for existing-e2e / needs-e2e / manual-only / blocked and how to detect harness support.references/checklist.md — Audit checklist by category: contract discovery, baseline, task audit, AI hygiene, flaky detection, quality gates.references/ai-implementation-audit.md — Red Flag scanners (RF-1..RF-6), Requirement→Test mapping, verdict matrix for completed tasks.references/independent-evaluator-protocol.md — What counts (and doesn't count) as evidence; transcript classification (genuine-failure / grader-bug / ambiguous-task / bypass-exploit).references/flaky-triage.md — Taxonomy, diagnosis protocol, and quarantine workflow for retry-passes-without-code-change failures./tmp/agent-output-audit-<slug>.Step 1: Discover the Repository Verification Contract
python3 scripts/discover-project-contract.py --root . to surface candidate install, verify, build, test, lint, start commands, and E2E signals.references/project-signals.md in full before picking commands when discovery surfaces more than one plausible gate or the repo mixes ecosystems.references/e2e-coverage.md in full before classifying any flow.make verify, just verify, or CI entrypoints over language-default commands.audit-output-path argument, use it. Otherwise use repository conventions, falling back to /tmp/agent-output-audit-<slug>. Create the audit/ subdirectory; store all bugs and reports under <audit-output-path>/audit/..compozy/tasks/<slug>/ exists, record the slug and switch into Compozy-aware audit:
state.yaml (read-only — never write to it; scripts/update-state.py owns mutation per the cy-codex-loop contract)._techspec.md (deliverable source of truth) and _tasks.md (task roster) when present.task_NN.md and capture its frontmatter status: value (allowed: pending, in_progress, completed). When task_NN.md frontmatter disagrees with state.yaml, treat frontmatter as the source of truth..compozy/tasks/<slug>/memory/qa-execution.md — Step 4 writes audit notes there before any status flips.Step 2: Run the Baseline Verification Gate
flaky-suspect, record in audit-report.md under SUITE HEALTH SNAPSHOT (test name, attempts, retry outcome, suspected category), and do NOT promote to PASS via retry. STOP. Read references/flaky-triage.md in full before assigning a suspected category or proposing a quarantine.Step 3: Audit Task Implementations (Compozy mode and any AI-implemented tasks)
Skip this step only when no task, phase, PRD, tech spec, or implementation-plan artifacts exist.
references/independent-evaluator-protocol.md in full before forming any task verdict. Tripwire summary: never accept the implementing agent's transcript, success message, or memory note as evidence. In Compozy mode, read the implementing agent's .compozy/tasks/<slug>/memory/<phase>.md artifacts and classify anomalies (genuine-failure / grader-bug / ambiguous-task / bypass-exploit) in the Errors / Corrections section of memory/qa-execution.md before judging the task.task_NN.md and its body. Summarize each task into a Task Implementation Matrix (column names mirror cy-codex-loop frontmatter):
task_path (e.g., .compozy/tasks/<slug>/task_07.md)declared_status — literal frontmatter status: valuetitle, type, complexity, dependencies — mirrored from frontmattertechspec_deliverable — linked section in _techspec.md when presentimplementation_evidence — files, modules, routes, commands, migrations, seeds, testsverification_evidence — commands executed, exit codes, output summariesqa_verdict — PASS | PARTIAL | FAIL | REOPEN | BLOCKED (distinct from declared_status)ai_audit_findings — red flag IDs that fired in Step 4 with verdictaction — none | fixed | reopened-frontmatter | BUG-NNN.md filedlinked_bugs — BUG IDsdeclared_status, checked checkbox, memory note, or prior agent summary as proof. Verify every completed or claimed-complete task against actual files, public behavior, automated tests, and acceptance criteria.qa_verdict:
PASS: every material requirement and success criterion has implementation and fresh verification evidence.PARTIAL: implementation exists but one or more non-critical requirements, tests, or evidence are missing.FAIL: claimed behavior does not work or a critical requirement is absent.REOPEN: the source task_NN.md has status: completed in frontmatter but the QA verdict is PARTIAL or FAIL.BLOCKED: audit cannot continue because a concrete prerequisite is missing.Step 4: AI Test-Hygiene Scan (RF-1..RF-6)
references/ai-implementation-audit.md in full before scanning the test diff of any task with declared_status: completed. That file owns the Red Flag scanners (RF-1..RF-6), the Requirement→Test mapping rules, and the verdict matrix.git log --follow <test_file>, git diff <baseline_sha>..HEAD).FAIL automatically when scanners detect:
.skip / .only / xit / t.Skip inserted in the diff (RF-1).External Dependencies as Integration/E2E (RF-3).ai_audit_findings and in the per-task block of audit-report.md.references/ai-implementation-audit.md. For every Success Criterion in task_NN.md (frontmatter or body) and every linked bullet in _techspec.md, find the corresponding test by name, reference, or assertion content. Mark each criterion covers / weak / missing. A checked item or status: completed without a covers row is an audit failure.Step 5: Reopen, File Bugs, Write Memory
REOPEN in the matrix..compozy/tasks/<slug>/memory/qa-execution.md using the canonical sections required by cy-codex-loop: Objective Snapshot, Important Decisions, Learnings, Files / Surfaces, Errors / Corrections, Ready for Next Run. This file must be written before any task_NN.md frontmatter is flipped (memory-precedes-status invariant).task_NN.md frontmatter status: back to pending (or in_progress if salvageable). Never write to state.yaml — cy-codex-loop's update-state.py owns mutation; frontmatter wins because the next iteration reconciles from it.BUG-<num>.md under <audit-output-path>/audit/issues/ using assets/issue-template.md. Include:
Reopens task:.Summary:.Root cause:.Automation Follow-up: notes.Related:.Step 6: Quality Gates Verdict
audit-report.md. Each gate is PASS / FAIL / N/A:
FAIL from AI test-hygiene audit on P0/P1 tasks.Critical / High issues open.flaky-suspect on P0 flows.FAIL on any gate blocks an unconditional PASS verdict for the run.Step 7: Write the Audit Report
assets/audit-report-template.md and write the report to <audit-output-path>/audit/audit-report.md.Reopens task: annotations.audit-report.md PASS feeds cy-codex-loop's verify.last_status=PASS precondition for Phase E — do not call update-state.py; cy-codex-loop owns that mutation.references/project-signals.md, choose the broadest safe install, lint, test, and build commands for the detected ecosystem, and state that assumption explicitly.blocked and report the exact prerequisite that is missing.task_NN.md files are marked status: completed but contain unchecked subtasks, missing deliverables, or unverified criteria, do not call the audit a pass. Write memory/qa-execution.md first, then edit frontmatter status: back to pending or in_progress, and file BUG-<num>.md per Step 5. Never write to state.yaml.flaky-suspect per references/flaky-triage.md, record the event in the Suite Health Snapshot, and treat any unresolved flaky-suspect on a P0 flow as a blocker for the final verdict.declared_status: completed, do not call the audit a pass. Apply the verdict matrix in references/ai-implementation-audit.md, file BUG-<num>.md with Type Functional, and flip frontmatter status: per Step 5.agent-output-audit validates that the implementing AI agent did what it claimed. qa-execution validates that a real human user can succeed at the product. They are complementary, not redundant:
agent-output-audit to certify that task_NN.md status: completed reflects real work.qa-execution to certify that the product, taken as a whole, is acceptable to end users.A Compozy slug typically wants both: audit the task implementations, then exercise the resulting product through user-flow QA. They share no output directory, no bug taxonomy, and no procedures — keep them separate.
tools
Plans real-user QA deliverables: personas, journey maps, exploratory charters, persona/journey/tour/CFR test cases, regression suites, Figma validation checks, automation intent, and user-impact bug reports. Writes artifacts under <qa-output-path>/qa/ for qa-execution to consume. Use when planning QA before execution, documenting journey-driven test strategy, marking flows that need E2E follow-up, or filing structured bug reports. Do not use for live execution, AI implementation audits, CI gate ownership, or technical integration/security/performance suites; use qa-execution or agent-output-audit instead.
development
Executes real-user QA sessions through public interfaces using personas, journeys, exploratory charters, test tours, edge-case probes, CFR checks, and browser evidence. Reads qa-report artifacts from <qa-output-path>/qa/ when present, captures issues/screenshots/reports under the same output tree, and classifies bugs by user impact. Use when validating a release candidate, migration, refactor, or user-facing change against production-like behavior. Do not use for AI implementation audits, task-status reconciliation, CI gate runs, integration/security/performance templates, or flaky-test triage; use agent-output-audit for those.
development
Transform outside-of-diff review files into properly formatted issue files for a given PR. Use when converting review files from ai-docs/reviews-pr-<PR>/outside/ into issue format in ai-docs/reviews-pr-<PR>/issues/. Automatically determines starting issue number and preserves all metadata (file path, date, status) from original review files. Don't use for inline-diff review files, non-PR review artifacts, or creating GitHub issues directly.
development
Enforce root-cause fixes over workarounds, hacks, and symptom patches in all software engineering tasks. Use when debugging issues, fixing bugs, resolving test failures, planning solutions, making architectural decisions, or reviewing code changes. Activates gate functions that detect and reject common workaround patterns such as type assertions, lint suppressions, error swallowing, timing hacks, and monkey patches. Don't use for trivial formatting changes or documentation-only edits.