Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

andrem-sec/agent-harness-construction

Name: agent-harness-construction
Author: andrem-sec

.claude/skills/workflow/agent-harness-construction/SKILL.md

npx skillsauth add andrem-sec/psc-comet agent-harness-construction

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent Harness Construction Skill

Framework for designing quality agents. Defines contracts, budgets, and measurement before implementation.

What Claude Gets Wrong Without This Skill

Without systematic agent design, agents:

Have unclear tool boundaries (too permissive or too restrictive)
Return inconsistent response formats (breaks downstream parsing)
Fail silently on errors (no recovery strategy)
Consume unbounded context (budget overruns)
Lack measurable success criteria (can't tell if agent is improving)

Agent harness construction ensures agents are well-specified before deployment.

Four Quality Dimensions

1. Action Space

Defines: What tools can this agent use? At what granularity?

Granularity: Micro (single file/command, high-risk), Medium (edit/read loops, standard dev), Macro (Task/orchestration, complex workflows).

Tool Allowlist Pattern:

tools:
  - Read
  - Grep
  - Glob
disallowedTools:
  - Write
  - Edit
  - Bash

Rule: Start restrictive, expand only when justified. Removing permissions later breaks existing workflows.

2. Observation Contract

Defines: What fields must every tool response include?

Required Fields:

status: SUCCESS | PARTIAL | FAILURE
summary: One-line description of what happened
next_actions: Array of suggested follow-ups
artifacts: Paths to files created/modified

Why: Enables reliable parsing, orchestrator coordination, and chaining without re-planning. Anti-pattern: raw output with no structure.

3. Recovery Contract

Defines: How does agent handle errors?

Strategies: Retry with backoff (transient errors, max 3), escalate to human (ambiguous/security, max 2 auto-attempts), graceful degradation (optional features unavailable), circuit breaker (3 identical failures = stop).

Recovery Contract Template:

recovery:
  transient_errors:
    max_retries: 3
    backoff: [1, 5, 15]
  ambiguous_errors:
    escalate_after: 2
  circuit_breaker:
    identical_failures: 3

4. Context Budget

Defines: Maximum tokens this agent can consume.

Budget Calculation:

Agent prompt: ~2,000 tokens (instructions, examples) Tools: ~500 tokens per tool schema × N tools Working context: File reads, conversation history Output: Agent responses

Example:

Agent with 10 tools: 2,000 + (500 × 10) + working = 7,000 base tokens
50 turns × 500 tokens/turn = 25,000 tokens working context
Total: 32,000 tokens (~$0.10 per agent session at Sonnet pricing)

Budget Limits:

| Agent Type | Token Budget | Use Case | |------------|--------------|----------| | Micro (researcher) | 10K-20K | Quick searches, single-file analysis | | Standard (planner, code-reviewer) | 20K-50K | Multi-file review, planning | | Complex (orchestrator, architect) | 50K-100K | System-wide analysis, coordination |

If budget exceeded:

Compact mid-session (strategic-compact skill)
Split into multiple agents (orchestrator pattern)
Reduce tool schema size (use macro-tools)

Three Architecture Patterns

ReAct (Reasoning and Acting)

Best For: Exploratory tasks, unclear solution paths, research

Pattern:

Reason: "I need to find the authentication logic"
Act: Grep for "auth" across codebase
Observe: Found in src/auth.ts
Reason: "Now I should read that file"
Act: Read src/auth.ts ...

Pros: Flexible, handles ambiguity, self-correcting Cons: Higher token cost (reasoning overhead), slower

Function-Calling (Structured Deterministic)

Best For: Well-defined tasks, repetitive operations, production workloads

Pattern:

User: "Add user validation"
Agent: [calls validate_user_input(field="email")]
System: Returns validation code
Done

Pros: Fast, predictable, low token cost Cons: Rigid, fails on ambiguous inputs

Hybrid (ReAct Planning + Function Execution)

Best For: Most agent tasks (recommended default)

Pattern:

Reason (ReAct): "Task requires editing 3 files in sequence"
Plan: [edit_file("a.ts"), edit_file("b.ts"), edit_file("c.ts")]
Execute (Function-Calling): Run plan with typed tool calls
Observe: All edits succeeded
Reason: Verify tests pass
Execute: run_tests()

Pros: Flexible planning, efficient execution Cons: Slightly higher complexity

Success Metrics

Track these metrics for every agent:

Completion Rate:

Tasks completed successfully / tasks attempted
Target: ≥85% for production agents

Retries Per Task:

Average retry attempts before success
Target: ≤1.5 retries per task

pass@1 / pass@3:

pass@1: Success on first attempt
pass@3: Success in at least one of three attempts
Targets: pass@1 ≥70%, pass@3 ≥90%

Cost Per Successful Task:

Total tokens consumed / successful completions
Track over time to detect regressions

Example Metrics Dashboard:

Agent: code-reviewer
Period: Last 30 days
Completion Rate: 88% (44/50 tasks)
Retries Per Task: 1.2
pass@1: 72%
pass@3: 92%
Cost Per Task: $0.08 avg

Anti-Patterns

Overpowered agents: Agent has Write, Edit, Bash, Task access when it only needs Read + Grep. Start restrictive.

No observation contract: Tool responses are raw text. Downstream parsing is brittle and breaks on edge cases.

Unlimited retries: Agent retries failed operation 20 times. Use circuit breaker (3 identical failures = stop).

No context budget: Agent consumes 200K tokens on simple task. Budget forces efficiency.

Missing metrics: Can't tell if agent is improving or degrading over time. Track pass@1, cost, completion rate.

Mandatory Checklist

Verify action space defined with tool allowlist and granularity specified
Verify observation contract includes status, summary, next_actions, artifacts fields
Verify recovery contract specifies retry strategy, escalation conditions, circuit breaker threshold
Verify context budget calculated (agent prompt + tools + working context) and limits set
Verify architecture pattern selected (ReAct, function-calling, or Hybrid) with justification
Verify success metrics defined (completion rate, retries, pass@k, cost)
Verify metrics tracked over time to detect regressions
Verify circuit breaker configured (3 identical failures recommended)

andrem-sec/agent-harness-construction

.claude/skills/workflow/agent-harness-construction/SKILL.md

Framework for designing quality agents with proper action space and contracts

1 stars

development

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add andrem-sec/psc-comet agent-harness-construction

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:49 PM1.7s1 file scanned

SKILL.md

name:: agent-harness-construction
description:: Track completion rate, retries, pass@k, cost per task
version:: 0.1.0
level:: 3
- name:: Define Success Metrics

Agent Harness Construction Skill

Framework for designing quality agents. Defines contracts, budgets, and measurement before implementation.

What Claude Gets Wrong Without This Skill

Without systematic agent design, agents:

Have unclear tool boundaries (too permissive or too restrictive)
Return inconsistent response formats (breaks downstream parsing)
Fail silently on errors (no recovery strategy)
Consume unbounded context (budget overruns)
Lack measurable success criteria (can't tell if agent is improving)

Agent harness construction ensures agents are well-specified before deployment.

Four Quality Dimensions

1. Action Space

Defines: What tools can this agent use? At what granularity?

Granularity: Micro (single file/command, high-risk), Medium (edit/read loops, standard dev), Macro (Task/orchestration, complex workflows).

Tool Allowlist Pattern:

tools:
  - Read
  - Grep
  - Glob
disallowedTools:
  - Write
  - Edit
  - Bash

Rule: Start restrictive, expand only when justified. Removing permissions later breaks existing workflows.

2. Observation Contract

Defines: What fields must every tool response include?

Required Fields:

status: SUCCESS | PARTIAL | FAILURE
summary: One-line description of what happened
next_actions: Array of suggested follow-ups
artifacts: Paths to files created/modified

Why: Enables reliable parsing, orchestrator coordination, and chaining without re-planning. Anti-pattern: raw output with no structure.

3. Recovery Contract

Defines: How does agent handle errors?

Recovery Contract Template:

recovery:
  transient_errors:
    max_retries: 3
    backoff: [1, 5, 15]
  ambiguous_errors:
    escalate_after: 2
  circuit_breaker:
    identical_failures: 3

4. Context Budget

Defines: Maximum tokens this agent can consume.

Budget Calculation:

Agent prompt: ~2,000 tokens (instructions, examples) Tools: ~500 tokens per tool schema × N tools Working context: File reads, conversation history Output: Agent responses

Example:

Agent with 10 tools: 2,000 + (500 × 10) + working = 7,000 base tokens
50 turns × 500 tokens/turn = 25,000 tokens working context
Total: 32,000 tokens (~$0.10 per agent session at Sonnet pricing)

Budget Limits:

If budget exceeded:

Compact mid-session (strategic-compact skill)
Split into multiple agents (orchestrator pattern)
Reduce tool schema size (use macro-tools)

Three Architecture Patterns

ReAct (Reasoning and Acting)

Best For: Exploratory tasks, unclear solution paths, research

Pattern:

Reason: "I need to find the authentication logic"
Act: Grep for "auth" across codebase
Observe: Found in src/auth.ts
Reason: "Now I should read that file"
Act: Read src/auth.ts ...

Pros: Flexible, handles ambiguity, self-correcting Cons: Higher token cost (reasoning overhead), slower

Function-Calling (Structured Deterministic)

Best For: Well-defined tasks, repetitive operations, production workloads

Pattern:

User: "Add user validation"
Agent: [calls validate_user_input(field="email")]
System: Returns validation code
Done

Pros: Fast, predictable, low token cost Cons: Rigid, fails on ambiguous inputs

Hybrid (ReAct Planning + Function Execution)

Best For: Most agent tasks (recommended default)

Pattern:

Reason (ReAct): "Task requires editing 3 files in sequence"
Plan: [edit_file("a.ts"), edit_file("b.ts"), edit_file("c.ts")]
Execute (Function-Calling): Run plan with typed tool calls
Observe: All edits succeeded
Reason: Verify tests pass
Execute: run_tests()

Pros: Flexible planning, efficient execution Cons: Slightly higher complexity

Success Metrics

Track these metrics for every agent:

Completion Rate:

Tasks completed successfully / tasks attempted
Target: ≥85% for production agents

Retries Per Task:

Average retry attempts before success
Target: ≤1.5 retries per task

pass@1 / pass@3:

pass@1: Success on first attempt
pass@3: Success in at least one of three attempts
Targets: pass@1 ≥70%, pass@3 ≥90%

Cost Per Successful Task:

Total tokens consumed / successful completions
Track over time to detect regressions

Example Metrics Dashboard:

Agent: code-reviewer
Period: Last 30 days
Completion Rate: 88% (44/50 tasks)
Retries Per Task: 1.2
pass@1: 72%
pass@3: 92%
Cost Per Task: $0.08 avg

Anti-Patterns

Overpowered agents: Agent has Write, Edit, Bash, Task access when it only needs Read + Grep. Start restrictive.

No observation contract: Tool responses are raw text. Downstream parsing is brittle and breaks on edge cases.

Unlimited retries: Agent retries failed operation 20 times. Use circuit breaker (3 identical failures = stop).

No context budget: Agent consumes 200K tokens on simple task. Budget forces efficiency.

Missing metrics: Can't tell if agent is improving or degrading over time. Track pass@1, cost, completion rate.

Mandatory Checklist

Verify action space defined with tool allowlist and granularity specified
Verify observation contract includes status, summary, next_actions, artifacts fields
Verify recovery contract specifies retry strategy, escalation conditions, circuit breaker threshold
Verify context budget calculated (agent prompt + tools + working context) and limits set
Verify architecture pattern selected (ReAct, function-calling, or Hybrid) with justification
Verify success metrics defined (completion rate, retries, pass@k, cost)
Verify metrics tracked over time to detect regressions
Verify circuit breaker configured (3 identical failures recommended)

Related Skills

andrem-sec/batch

data-ai

VerifiedTrustedCommunity

Parallel agent swarm — decomposes work into independent units, spawns isolated workers, tracks PRs via fan-in

1SKILL.mdUpdated Apr 16, 2026

andrem-sec/animation-safe

testing

VerifiedTrustedCommunity

Audit animations and transitions for motion accessibility, performance safety, and design intent. Enforces prefers-reduced-motion compliance and blocks layout-triggering transitions.

1SKILL.mdUpdated Apr 16, 2026

andrem-sec/animation-safe

andrem-sec/ai-regression-testing

testing

VerifiedTrustedCommunity

Test specifically for AI-introduced regressions that repeat without tests

1SKILL.mdUpdated Apr 16, 2026

andrem-sec/ai-regression-testing

andrem-sec/agentic-engineering

development

VerifiedTrustedCommunity

Framework for decomposing agent-driven tasks into independently verifiable units

1SKILL.mdUpdated Apr 16, 2026

andrem-sec/agentic-engineering

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/andrem-sec/psc-comet.git

# Copy into Claude Code skills folder (global)
cp -r psc-comet/.claude/skills/workflow/agent-harness-construction ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

andrem-sec/psc-comet

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT