aops-core/skills/craft/SKILL.md
Instruction quality gate — reviews agent instructions (task bodies, workflow steps, skill procedures, self-test protocols) for shallow-execution vulnerabilities before deployment. Two modes: author (pre-hoc review) and audit (trace a failure back to the instruction gap). The bar is excellence, not compliance.
npx skillsauth add nicsuzor/academicops craftInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Agents optimise for the shallowest valid interpretation of their instructions. An instruction that says "run tests and report results" produces an agent that reads the last line of pytest output and declares green — even when the hook logs contain schema validation errors, the JSONL transcript records silent failures, and the full output pipeline is broken.
The gap isn't in the agent. It's in the instruction. Shallow instructions produce shallow execution. No amount of downstream enforcement fixes this.
This skill is the quality gate that prevents shallow instructions from reaching agents.
Author mode — you have instructions (a task body, a workflow step, a self-test protocol, a polecat dispatch brief). Before deploying them, invoke /craft to review for shallow-execution vulnerabilities.
Audit mode — an agent underperformed or missed something. You have the transcript. Invoke /craft audit to trace the failure back to the instruction gap and propose a rewrite.
Review the instructions against these defect classes. Any one is sufficient to reject.
The instruction defines success as "did it run?" instead of "did it produce correct, complete, verified output?"
| Defect | Fix |
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "Run tests and report results" | "Run tests. Read the full output — not just the summary line. Read every line and note anything unexpected: errors, warnings, deprecations, schema mismatches, permission denials. Report the count AND any anomalies found anywhere in the output." |
| "Check the output" | "Read every line of <specific file>. Verify that <specific expected content> is present and <specific error patterns> are absent." |
| "Confirm it works" | "Invoke <specific action>. Verify the output matches <specific expected state>. If it doesn't, report exactly what you observed." |
Test: If an agent could satisfy the instruction by reading a single summary line and reporting "all good," the instruction has compliance framing.
The instruction names the primary output channel but not the secondary ones where failures hide.
Every system has multiple output channels. A polecat dispatch produces stdout (summary), JSONL transcript (raw session), hook logs (hook events including errors), session JSON (gate states), and enforcer reports. A CI run produces stdout, stderr, exit code, artifact uploads, and log files. Instructions that only check one channel miss failures in the others.
Test: List every artifact the system produces. If the instruction doesn't name at least the top three, it has a missing artifact chain.
The instruction doesn't ask "what would fail silently?" Silent failures are the most dangerous class — the system appears healthy, the agent declares success, and the actual failure propagates undetected.
Common silent failure patterns:
warn instead of block (warning emitted but agent continues regardless)Test: Can you name a failure mode that would produce zero visible errors in the primary output channel? If yes, and the instruction doesn't check for it, this defect is present.
The instruction accepts a summary or claim as evidence instead of requiring independent verification.
| Defect | Fix |
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "The agent reported all tests passed" | "Read the test output file directly. Count passed/failed/error yourself." |
| "No errors were found" | "Read <specific files> in full. For each file, state what you observed. If you found nothing unexpected, say so explicitly and name what you checked." |
| "The task completed successfully" | "Verify the expected output artifact exists at <path>, contains <expected content>, and was modified after <timestamp>." |
Test: If the instruction's verification step could be satisfied by quoting the agent's own summary, it has summary-as-evidence.
The instruction doesn't tell the agent what to do when it reaches the edge of its search space and finds nothing.
The most dangerous instruction gap: the agent finds no problems in the obvious place, concludes there are no problems, and stops. It never looks in the non-obvious places because the instruction didn't tell it to.
| Defect | Fix |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "Check the logs for errors" | "Check <primary log> for errors. If you find none, also check <secondary log> and <tertiary source> before declaring clean." |
| "Verify the hook fired" | "Verify the hook output appears in <expected channel>. Also verify it does NOT appear in <wrong channel> (routing inversion). Also check <error log> for schema validation rejections." |
Test: If the instruction has a verification step with only one place to look, and the agent finds nothing there, what does it do? If the answer is "declare success," this defect is present.
The instruction doesn't require the agent to actually read all the output. It assumes a summary or a grep is enough.
QA is expensive. It is always expensive. The instruction must not flinch from that cost. When an agent runs a system and needs to verify correctness, the instruction must require reading the full output of every artifact — not skimming, not grepping for keywords, not reading the last 10 lines. Reading. Every. Line.
Keyword grep is shitty NLP (P#49). An agent that greps for error and finds nothing declares success — but the actual failure said Hook JSON output validation failed or Invalid input or used vocabulary the grep didn't anticipate. The fix is not a better grep. The fix is to read the output and understand it.
| Defect | Fix | | ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | "Grep the logs for errors" | "Read the full hook log. For each entry, verify the output was accepted by the consumer. Report any entries where the output was rejected, malformed, or produced an unexpected result." | | "Check the last 10 lines of output" | "Read the complete output. Note anything unexpected — not just errors, but warnings, deprecations, schema mismatches, permission denials, and any line that doesn't match the expected happy-path output." | | "Scan for failures" | "Read every artifact the system produced. For each one, state what you observed. If an artifact is missing that should exist, note that. If an artifact contains content that doesn't belong, note that." |
Test: Does the instruction use "grep", "scan", "check for", or "look for" as its verification verb? If so, it's asking the agent to pattern-match instead of comprehend. Replace with "read" and "verify."
The instruction only checks for the presence of expected output, never for the absence of unexpected output.
Positive verification ("does the expected thing exist?") catches omissions. Negative verification ("does anything unexpected exist?") catches corruption, leakage, and unintended side effects.
| Defect | Fix |
| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| "Verify the config file exists" | "Verify the config file exists AND contains no unexpected keys, no TODO/FIXME, no template variables like {placeholder}." |
| "Confirm the hook output reached the agent" | "Confirm the hook output reached the agent AND did not also leak to the user surface." |
Test: If the instruction only uses "verify X exists" or "confirm X happened" without any "also verify Y did NOT happen," this defect is present.
Given: a transcript where an agent failed, underperformed, or missed something the user had to catch.
This skill exists because academicOps is building a world-leading AI framework. The instructions we write for our agents define the upper bound of their performance. An agent cannot exceed the ambition of its instructions.
The standard is not "would a competent agent succeed with these instructions?" The standard is "do these instructions make it impossible for an agent to declare success without actually verifying success?"
Compliance is the floor. Excellence is the bar. The difference is in the instructions.
/dogfood tests instructions by running them against a contextless agent. /craft reviews instructions by reading them. Use /craft before /dogfood Phase 2 (Commission Execution) as a pre-flight quality gate. If /craft says REVISE, fix the instructions before spending compute on a dogfood run./design-rubric designs fitness criteria for user-facing deliverables. /craft designs quality criteria for agent-facing instructions. Same shape (design-time quality gate), different domain./verify checks artifacts for correctness. /craft checks instructions for depth. An instruction is a type of artifact, but its quality criteria are about what it will PRODUCE, not what it IS./survey retro reviews transcripts for problems. When retro finds a shallow-execution failure, classify it as root cause category "Instruction Gap" and reference /craft audit for the fix.| Anti-pattern | Why it fails |
| -------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "The instructions are clear enough" | Clear is not deep. "Check the output" is perfectly clear — and perfectly shallow. |
| Adding more steps without more depth | Ten shallow steps are not better than three deep ones. Depth is verification specificity, not step count. |
| Specifying tools instead of goals | "Run grep -r error" is brittle. "Search for error patterns in all output channels" is resilient. Name what to find, not how to find it. |
| Reviewing instructions without the system's failure vocabulary | You can't assess adversarial coverage without knowing how the system fails silently. Read the system's error handling before reviewing its instructions. |
| Declaring SHIP because no defects are obvious | The seven defects are common patterns, not an exhaustive list. If the instructions feel shallow but don't match a named defect, trust the feeling and articulate why. |
tools
Program / portfolio supervision — the autonomous top loop above /supervisor. "Ready the release" → discover and decompose the constituent epics → run /supervisor on each → surface only escalations + merge-ready PRs. Stateless tick driven by /loop; all cross-tick state lives in the program task body.
development
Mirror PKB tasks onto the Cowork native task list at claim time and sync completion back to PKB. Cowork-only; ships only in the cowork build of aops-core.
content-media
Design-stage fitness rubric — persona immersion, scenario design, dimensions that define what excellence looks like for the people a feature serves. Two modes — author (produce a rubric for a new spec) and critique (red-team an existing spec). Output lives on the spec, not in the verification brief. Owned by pauli.
tools
Analyze writing samples and create a comprehensive personal writing style guide