skills/harness-engineering/SKILL.md
Design, audit, and improve the agent harness — the environment, constraints, context management, evaluation gates, and feedback loops that surround and guide AI agents. Inspired by OpenAI and Anthropic harness engineering best practices. Triggers: "harness", "harness audit", "improve harness", "agent environment", "context management", "evaluation gates", "feedback loop", "harness engineering".
npx skillsauth add Wilder1222/superomni harness-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Status protocol — end every session with one of: DONE (evidence provided) · DONE_WITH_CONCERNS (list each) · BLOCKED (state what blocks you) · NEEDS_CONTEXT (state what you need).
Auto-advance — pipeline: THINK → PLAN → REVIEW → BUILD → VERIFY → RELEASE. Only human gate is spec approval at THINK. On DONE at other stages, print [STAGE] DONE -> advancing to [NEXT-STAGE] and invoke the next skill. On any non-DONE status at any stage, STOP.
Output directory — all artifacts go in docs/superomni/<kind>/<kind>-[branch]-[session]-[date].md. See CLAUDE.md for the full directory map.
TACIT-DENSE — before high-tacit decisions, classify D1 (domain expertise) · D2 (user-facing UX) · D3 (team culture) · D4 (novel pattern). On hit, output TACIT-DENSE [D#]: [question] — My default: [recommendation]. See reference for actions.
Anti-sycophancy — take a position on every significant question. Name flaws directly. No filler ("that's interesting", "you might consider", "that could work").
Telemetry (local only) — at session end, log bin/analytics-log. Nothing leaves the machine.
See preamble-ref.md for detailed protocols.
Goal: Design and maintain the agent harness — the scaffolding of environment, context, tools, constraints, evaluation gates, and feedback loops that determine how well agents perform.
"Engineers design the system. Agents execute." — OpenAI Harness Engineering
THE HARNESS IS THE PRODUCT. CODE IS ITS OUTPUT.
A well-designed harness produces reliable, high-quality agent output without requiring manual intervention on every task. When agents fail repeatedly, the correct response is to improve the harness — not to keep retrying the same prompt.
| Principle | What it means in superomni | |-----------|---------------------------| | Context is everything | Agents can only work with what they can see — keep docs, specs, and constraints in-repo and up-to-date | | Fewer, more expressive tools | Prefer composable skills over sprawling tool menus | | Evaluate relentlessly | Judgment gates must exist at every major transition point | | Signal-driven iteration | Agent failures are design signals — update the harness, not just the prompt | | Boring > clever | Prefer simple, composable patterns over novel abstractions | | Garbage collection | Periodically audit for drift, stale docs, and architectural decay |
Take stock of the current harness state:
# Skill count + structure
ls skills/ | wc -l
ls skills/
# Agent count
ls agents/
# Command count
ls commands/
# Preamble size (context overhead)
wc -l lib/preamble.md
# Skill template sizes (larger = more context pressure)
wc -l skills/*/SKILL.md.tmpl | sort -n | tail -10
# Validation status
npm test 2>/dev/null || bash lib/validate-skills.sh 2>/dev/null
# Recent harness changes
git log --oneline -10 -- lib/ skills/ agents/ commands/
# Any stale/out-of-date docs
find docs/ -name "*.md" -older /tmp 2>/dev/null | head -10
Document findings:
Context window pressure is one of the most common causes of agent degradation. Audit the harness context load:
Review lib/preamble.md:
Target preamble size: < 150 lines. Flag if > 200 lines.
For each skill > 200 lines, ask:
Does the framework expose only necessary context at each stage?
| Stage | Context needed | Currently loaded | |-------|---------------|-----------------| | Planning | spec, constraints | | | Implementation | plan, code context | | | Review | diff, standards | | | Debug | error, minimal repro | |
Good harnesses load context on demand, not all at once.
Per Anthropic's principle: fewer, more expressive tools outperform large menus of narrow ones.
Review the agent's tool access:
# Check allowed-tools across all skills
grep "allowed-tools" skills/*/SKILL.md.tmpl
For each skill, evaluate:
Recommended tool sets by role:
| Role | Minimal tool set | |------|----------------| | Planning / Brainstorming | Read, Write, Glob | | Implementation | Bash, Read, Write, Edit, Grep, Glob | | Review / Audit | Read, Grep, Glob | | Debug | Bash, Read, Grep, Glob |
Flag any skill whose tool set exceeds its role's minimum.
"Evaluation is the load-bearing part of agent harness design." — OpenAI/Anthropic harness engineering principles
Map every major workflow transition and verify an evaluation gate exists:
| Transition | Evaluation gate | Present? | |-----------|----------------|---------| | Spec → Plan | plan-review skill or planner-reviewer agent (planning mode) | | | Plan → Execution | dependency analysis wave plan | | | Execution Wave → Next Wave | wave verification step | | | Implementation → Review | code-review skill or planner-reviewer agent | | | Review → Ship | production-readiness skill | | | Ship → Done | verification skill | | | Sprint → Next Sprint | self-improvement skill | |
Any gap = harness deficiency. Add missing gates.
A healthy harness converts agent failures into harness improvements:
Agent fails → Signal captured → Harness updated → Agent retries → Improvement
↑ |
└───────────────────────────────────────────────────────────────────┘
Check the current feedback paths:
When an agent fails a task repeatedly (3+ attempts), is there a defined process to:
Does the self-improvement skill output get consumed?
ls docs/superomni/improvements/ 2>/dev/null | head -5
Is there a regular cadence for cleaning up:
Recommended: Schedule a harness GC pass after every 5 sprints.
Score the harness on each dimension (1-5):
| Dimension | Score | Key Finding | |-----------|-------|------------| | Context efficiency | /5 | | | Tool space minimalism | /5 | | | Evaluation gate coverage | /5 | | | Feedback loop completeness | /5 | | | Documentation freshness | /5 | |
Total: __ / 25
Scoring guide:
For each finding from Phases 2-5 with a score < 4:
HARNESS IMPROVEMENT [N]: [TITLE]
Dimension: [context | tools | evaluation | feedback | docs]
Finding: [specific issue identified]
Impact: [how this degrades agent performance]
Fix: [concrete change to harness — specific file, section, or process]
Priority: [P0 — blocks agent / P1 — degrades quality / P2 — nice to have]
Generate a prioritized backlog. P0 items must be fixed before the next sprint.
HARNESS_DIR="docs/superomni/harness-audits"
mkdir -p "$HARNESS_DIR"
BRANCH=$(git branch --show-current 2>/dev/null | tr '/' '-' || echo "main")
TIMESTAMP=$(date +%Y-%m-%d-%H%M%S)
REPORT_FILE="$HARNESS_DIR/harness-audit-${BRANCH}-${TIMESTAMP}.md"
echo "Saving harness audit to $REPORT_FILE"
Save the full audit report including all scores, findings, and improvement backlog.
HARNESS AUDIT REPORT
════════════════════════════════════════
Branch: [branch]
Date: [date]
Skills / Agents: [N] skills, [N] agents, [N] commands
Preamble size: [N] lines ([OK / BLOATED])
Validation: [PASS / FAIL]
Health score: [N]/25 ([rating])
Top finding: [single most important issue]
P0 improvements: [N]
P1 improvements: [N]
P2 improvements: [N]
Report saved: [docs/superomni/harness-audits/...]
Status: DONE | DONE_WITH_CONCERNS | BLOCKED
════════════════════════════════════════
development
Systematic, behavior-preserving code refactoring with safety gates. Dispatches refactoring-agent. Triggers: "refactor", "clean up code", "reduce tech debt", "extract method", "rename". NOT for reactive PR feedback — use code-review for that.
development
Meta-skill: create, install, list, and manage skills and agents within the superomni framework. Merges writing-skills + agent-management into one unified workflow. Triggers: "create skill", "write a skill", "install skill", "list skills", "create agent", "write an agent", "install agent", "list agents", "new skill", "new agent", "add skill", "add agent", "manage framework".
testing
Dependency security, license, and freshness audit. Dispatches dependency-auditor agent to scan all package managers. Triggers: "dependency audit", "check dependencies", "npm audit", "security scan", "check for vulnerabilities", "outdated packages", "license check".
development
Meta-skill: use when creating a new skill for the superomni framework. Guides through the process of designing and writing a well-structured skill. Triggers: "create a new skill", "write a skill for", "add a skill that".