.claude/skills/workflow/agent-harness-construction/SKILL.md
Framework for designing quality agents with proper action space and contracts
npx skillsauth add andrem-sec/psc-comet agent-harness-constructionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Framework for designing quality agents. Defines contracts, budgets, and measurement before implementation.
Without systematic agent design, agents:
Agent harness construction ensures agents are well-specified before deployment.
Defines: What tools can this agent use? At what granularity?
Granularity: Micro (single file/command, high-risk), Medium (edit/read loops, standard dev), Macro (Task/orchestration, complex workflows).
Tool Allowlist Pattern:
tools:
- Read
- Grep
- Glob
disallowedTools:
- Write
- Edit
- Bash
Rule: Start restrictive, expand only when justified. Removing permissions later breaks existing workflows.
Defines: What fields must every tool response include?
Required Fields:
status: SUCCESS | PARTIAL | FAILUREsummary: One-line description of what happenednext_actions: Array of suggested follow-upsartifacts: Paths to files created/modifiedWhy: Enables reliable parsing, orchestrator coordination, and chaining without re-planning. Anti-pattern: raw output with no structure.
Defines: How does agent handle errors?
Strategies: Retry with backoff (transient errors, max 3), escalate to human (ambiguous/security, max 2 auto-attempts), graceful degradation (optional features unavailable), circuit breaker (3 identical failures = stop).
Recovery Contract Template:
recovery:
transient_errors:
max_retries: 3
backoff: [1, 5, 15]
ambiguous_errors:
escalate_after: 2
circuit_breaker:
identical_failures: 3
Defines: Maximum tokens this agent can consume.
Budget Calculation:
Agent prompt: ~2,000 tokens (instructions, examples) Tools: ~500 tokens per tool schema × N tools Working context: File reads, conversation history Output: Agent responses
Example:
Budget Limits:
| Agent Type | Token Budget | Use Case | |------------|--------------|----------| | Micro (researcher) | 10K-20K | Quick searches, single-file analysis | | Standard (planner, code-reviewer) | 20K-50K | Multi-file review, planning | | Complex (orchestrator, architect) | 50K-100K | System-wide analysis, coordination |
If budget exceeded:
Best For: Exploratory tasks, unclear solution paths, research
Pattern:
Pros: Flexible, handles ambiguity, self-correcting Cons: Higher token cost (reasoning overhead), slower
Best For: Well-defined tasks, repetitive operations, production workloads
Pattern:
Pros: Fast, predictable, low token cost Cons: Rigid, fails on ambiguous inputs
Best For: Most agent tasks (recommended default)
Pattern:
Pros: Flexible planning, efficient execution Cons: Slightly higher complexity
Track these metrics for every agent:
Completion Rate:
Retries Per Task:
pass@1 / pass@3:
Cost Per Successful Task:
Example Metrics Dashboard:
Agent: code-reviewer
Period: Last 30 days
Completion Rate: 88% (44/50 tasks)
Retries Per Task: 1.2
pass@1: 72%
pass@3: 92%
Cost Per Task: $0.08 avg
Overpowered agents: Agent has Write, Edit, Bash, Task access when it only needs Read + Grep. Start restrictive.
No observation contract: Tool responses are raw text. Downstream parsing is brittle and breaks on edge cases.
Unlimited retries: Agent retries failed operation 20 times. Use circuit breaker (3 identical failures = stop).
No context budget: Agent consumes 200K tokens on simple task. Budget forces efficiency.
Missing metrics: Can't tell if agent is improving or degrading over time. Track pass@1, cost, completion rate.
data-ai
Parallel agent swarm — decomposes work into independent units, spawns isolated workers, tracks PRs via fan-in
testing
Audit animations and transitions for motion accessibility, performance safety, and design intent. Enforces prefers-reduced-motion compliance and blocks layout-triggering transitions.
testing
Test specifically for AI-introduced regressions that repeat without tests
development
Framework for decomposing agent-driven tasks into independently verifiable units