codex/skills/harness/SKILL.md
Review an agentic system’s configuration and implementation quality. Use when the user wants an opinionated assessment of a system prompt, tool surface, orchestration, guardrails, context handling, or eval setup, and wants concrete recommendations or a redesign plan.
npx skillsauth add tkersey/dotfiles harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Review the agent as an engineering system, not just a prompt.
Harness audits the parts that usually determine whether an agent is actually reliable:
The output should be a judgment with evidence, not a neutral summary.
Use this skill when the user wants to:
Do not use this skill for:
Follow these steps in order.
Start by finding the actual control surface. Look for:
If repository search is needed, run:
./scripts/find_agent_surface.sh
or:
./scripts/find_agent_surface.sh <path>
If the script is unavailable or incomplete, perform an equivalent search manually.
Do not score from one file if the repo clearly has a wider agent surface.
Create a compact evidence map under these headings:
If key evidence is absent, say so before scoring.
Read references/scoring-rubric.md and use it directly.
This rubric is intentionally strict. Do not inflate scores because a demo “basically works.”
Apply hard caps from the rubric when evidence is missing.
Read references/research-basis.md before writing the final verdict.
Your job is not to list generic tradeoffs. Your job is to say what is solid, what is weak, and what should change first.
Examples of acceptable opinions:
Use assets/report-template.md as the report structure.
The report must include:
Judge the prompt on control quality, not eloquence.
Check for:
Penalize:
Review tools as contracts between code and a non-deterministic model.
Check for:
Reward tools that are easy for a new engineer — and for the model — to use correctly.
Penalize tools that dump huge payloads, expose vague parameters, or duplicate one another.
Prefer the simplest architecture that can succeed.
Default stance:
Check for:
Treat guardrails as part of the design, not an afterthought.
Check for:
A mature system should be inspectable.
Check for:
Do not call a system “strong” without runtime evidence.
If the user provides only a prompt, only a schema, or only a partial config:
Never hallucinate maturity.
Recommend the smallest reliability wins first:
Be direct. Be evidence-backed. Do not soften weak design into “just a tradeoff” if it is clearly brittle.
tools
Convert markdown plans into beads with dependencies using br CLI. Use when creating task graphs, polishing beads before implementation, or bridging planning to agent swarm execution.
development
Orchestrate Codex skill optimization during active sessions through $cas goal control, $shadow single-session evidence, $tune diagnosis/refinement briefs, and the skill-optimizer custom subagent. Trigger for $opt, skill optimization loops, session-driven skill tuning, meta-skill audits, or explicit validated skill edits. Do not use for general code optimization, product optimization, or performance tuning.
development
Run a targeted fresh-eyes blunder pass over code, specs, plans, adjudications, closure gates, skill edits, or negative-evidence ledgers. Trigger when asked to reread with fresh eyes, find obvious bugs, catch mistakes/oversights/omissions, check for embarrassing misses, or perform a second independent blunder pass before closure. Do not use as a substitute for implementation, adjudication, or verification; use it as the final falsification/check pass for those workflows.
development
Explicitly shadow, tail, watch, follow, monitor, supervise, or companion exactly one Codex session id/path through `$seq`, then apply a named target skill as an interpretation/reporting/proposal/action lens until the watched session stops.