Harness Engineering

Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.

When to Activate

Activate this skill when:

Building autonomous research or experimentation loops
Designing an agent environment with locked metrics and editable code or content
Creating PR-producing or background agents
Evaluating whether an agent can safely run without frequent human prompts
Adding novelty, ablation, pruning, rollback, or durable logging to an agent workflow
Preventing agents from gaming benchmarks, weakening rubrics, or losing state across compaction

Do not activate this skill for adjacent work owned by other skills:

General quality gates, regression suites, or outcome metrics without autonomous control surfaces: evaluation.
Tool schemas, response formats, and recovery errors for harness tools: tool-design.
Project-level task-model fit, pipeline shape, and cost planning: project-development.
Remote sandbox, warm-pool, and hosted session infrastructure: hosted-agents.

Core Concepts

Harness Boundary

Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules.

Use four surface classes:

| Surface | Examples | Rule | | --- | --- | --- | | Locked | Eval metric, rubric, validation script, merge policy | Agent may read and propose changes, but cannot score itself with modified rules | | Editable | Skill draft, experiment file, prompt, config under test | Agent may mutate during the loop | | Append-only | Results log, research thread, rejected ideas | Agent may append, not rewrite | | Human-controlled | Merge, production deploy, credentials, destructive operations | Requires explicit human approval |

Tight Feedback Loops

Autonomy works when feedback is fast, unambiguous, and hard to game. Karpathy's autoresearch is the minimal pattern: one editable file, one locked evaluation file, fixed wall-clock budget, one scalar metric, git rollback, and a durable results log. The lesson is not that every harness needs one metric; it is that ambiguous feedback creates ambiguous autonomy.

For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds.

Durable State

Long-running agents must externalize state. Store plans, source queues, results, failures, and handoffs in files so future agents can resume without relying on chat history. Prime Intellect's autonomous nanoGPT work showed the value of durable scratchpads and THREAD.md-style logs for recovery, monitoring, and audit.

Use append-only logs for:

What was tried
What improved or failed
Why a candidate was kept, discarded, or routed to review
Which upstream sources were checked
What the next agent should do

Search Discipline

Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules:

Refresh upstream sources on a schedule.
Require novelty checks before spending large budgets.
Preserve rejected attempts to avoid rediscovery.
Run leave-one-out pruning when a stack has multiple additions.
Reward simplification when quality is equal.
Use separate verification before promotion.

Mechanism Registry

For research-to-skill systems, track accepted mechanisms separately from prose. A mechanism record should include a stable mechanism_id, owning_skill, status, activation scenario, behavior change, evidence, and failure modes. Novelty gates should compare against this registry before using broader corpus overlap, because keyword overlap catches stale phrasing while mechanism comparison catches real duplication.

Governance

Autonomous agents may prepare PRs, but governance must be explicit. They can draft changes, run checks, and write PR summaries. They should not merge, deploy, or push without human approval unless the user has explicitly granted that permission for the specific action.

Detailed Topics

Autoresearch-Style Loop

Use this pattern when optimizing an artifact against a stable evaluator:

read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint
-> run evaluator -> log result -> keep if better -> discard or rollback if worse
-> repeat

Required properties:

The evaluator is outside the editable surface.
The feedback cadence is fixed enough to compare attempts.
Failed attempts leave an audit trail.
Rollback is cheap.
The agent has a policy for crashes and timeouts.

Research-To-Skill Loop

Use this pattern when sources become skill changes:

discover -> retrieve -> gate -> score -> extract mechanism
-> map to existing or new skill -> draft proposal -> validate structure
-> prepare PR -> human review

The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta.

Metric Gaming Resistance

Assume an optimizing agent will learn the harness. Guard against:

Editing evaluation code or rubrics and then using the new version for self-approval
Adding verbose content that pleases a judge but harms skill activation
Citing unretrieved sources
Optimizing aggregate scores while failing a critical dimension
Avoiding failed results in the log

Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review.

Monitoring Agents

Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly tasked otherwise. Monitoring output should report:

Best current candidate
Active jobs or drafts
Last upstream refresh
Failed or stale loops
Disagreements between logs and claimed state
Next action and blocker

Practical Guidance

Harness Design Checklist

Define the objective in one sentence.
Identify locked, editable, append-only, and human-controlled surfaces.
Choose the feedback mechanism: scalar metric, rubric, deterministic tests, human review, or combination.
Define keep, discard, crash, timeout, and review states.
Create a durable thread log before the loop starts.
Add source refresh, mechanism-registry novelty, and pruning rules for long-running loops.
Define what the agent may do without asking and what requires approval.
Validate the harness on one known good and one known bad artifact.

File Layout

research-run/
  THREAD.md
  sources/
    queue.md
    evaluations/
  proposals/
  logs/
    results.tsv
    rejected.md
  drafts/

Use TSV or JSONL for append-only machine-readable logs. Use Markdown for handoffs and reviewer-facing summaries.

Examples

Example 1: Locked metric

An agent optimizes train.py, but prepare.py owns data loading and evaluation. The agent can edit the model but cannot change the metric. Failed experiments are logged and rolled back.

Example 2: Locked rubric

An agent evaluates a new Anthropic or OpenAI engineering post, but the source curation rubric is locked for the run. If the source passes, the agent drafts a skill proposal. It cannot lower the rubric threshold to admit the source.

Example 3: Auto-PR without auto-merge

An agent prepares a branch and PR body after passing source, skill, and structure checks. The PR states unresolved risks and waits for human merge approval.

Guidelines

Lock evaluators before starting the loop.
Keep editable surfaces narrow enough for reliable diffs.
Write durable logs before context compaction can erase state.
Report per-dimension scores instead of only aggregate scores.
Require source retrieval before citation.
Add novelty gates for broad search and pruning gates for complex stacks.
Prefer simplification when quality is equal.
Separate PR preparation from merge authority.
Revalidate harness changes with old and new evaluators.
Treat stopped autonomous loops as harness failures, not agent personality quirks.

Gotchas

Mutable evaluator: If the agent can edit the metric, it may optimize the benchmark instead of the task. Keep rubrics and eval code locked during the run.
Chat-only memory: Long runs fail after compaction when plans live only in conversation history. Write thread logs and result files from the start.
No discard record: Without rejected-attempt logs, agents repeat failed ideas. Preserve failures with enough detail to avoid rediscovery.
Complexity accretion: Agents stack changes and rarely remove them. Require pruning rounds and reward equal-quality simplification.
Premature novelty claims: Agents label recombinations as novel. Compare against existing repo skills, source queue, and rejected logs before claiming novelty.
Monitor misreporting: Monitoring agents can summarize stale or inconsistent state. Require them to cite the files or logs behind claims.
Human approval ambiguity: "Prepare a PR" is not "merge a PR." Make approval boundaries explicit in the harness.
Volatile source drift: Fast-moving lab claims age quickly. Put dated evidence in references and schedule revalidation.

Integration

This skill connects to:

evaluation - Rubrics and quality gates provide the locked feedback surface
advanced-evaluation - Pairwise comparison and bias mitigation improve proposal review
filesystem-context - Durable logs, scratchpads, and thread files preserve state
multi-agent-patterns - Researcher, verifier, monitor, and writer agents need isolated contexts
tool-design - Harness tools must expose clear contracts and recovery errors
project-development - File-based pipelines and task-model fit analysis keep loops simple
hosted-agents - Background execution needs sandbox, snapshot, and approval boundaries

References

Internal references:

researcher/README.md - Read when implementing the repo-native research-to-skill operating system
researcher/rubrics/harness-change.md - Read when evaluating changes to an agent harness
researcher/runbooks/autonomous-research-loop.md - Read when running a source-to-skill loop

External resources:

Karpathy autoresearch - Constrained autonomous experiment loop with locked evaluation
Prime Intellect autonomous nanoGPT speedrun - Durable scratchpads, handoffs, monitoring, and autonomy failure modes
AlphaEvolve and FunSearch - LLM-generated candidates paired with systematic evaluators
HELM and LM Evaluation Harness - Transparent, reproducible evaluation infrastructure

Skill Metadata

Created: 2026-05-14 Last Updated: 2026-05-15 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0

Harness Engineering

When to Activate

Activate this skill when:

Building autonomous research or experimentation loops
Designing an agent environment with locked metrics and editable code or content
Creating PR-producing or background agents
Evaluating whether an agent can safely run without frequent human prompts
Adding novelty, ablation, pruning, rollback, or durable logging to an agent workflow
Preventing agents from gaming benchmarks, weakening rubrics, or losing state across compaction

Do not activate this skill for adjacent work owned by other skills:

General quality gates, regression suites, or outcome metrics without autonomous control surfaces: evaluation.
Tool schemas, response formats, and recovery errors for harness tools: tool-design.
Project-level task-model fit, pipeline shape, and cost planning: project-development.
Remote sandbox, warm-pool, and hosted session infrastructure: hosted-agents.

Core Concepts

Harness Boundary

Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules.

Use four surface classes:

Tight Feedback Loops

For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds.

Durable State

Use append-only logs for:

What was tried
What improved or failed
Why a candidate was kept, discarded, or routed to review
Which upstream sources were checked
What the next agent should do

Search Discipline

Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules:

Refresh upstream sources on a schedule.
Require novelty checks before spending large budgets.
Preserve rejected attempts to avoid rediscovery.
Run leave-one-out pruning when a stack has multiple additions.
Reward simplification when quality is equal.
Use separate verification before promotion.

Mechanism Registry

Governance

Detailed Topics

Autoresearch-Style Loop

Use this pattern when optimizing an artifact against a stable evaluator:

read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint
-> run evaluator -> log result -> keep if better -> discard or rollback if worse
-> repeat

Required properties:

The evaluator is outside the editable surface.
The feedback cadence is fixed enough to compare attempts.
Failed attempts leave an audit trail.
Rollback is cheap.
The agent has a policy for crashes and timeouts.

Research-To-Skill Loop

Use this pattern when sources become skill changes:

discover -> retrieve -> gate -> score -> extract mechanism
-> map to existing or new skill -> draft proposal -> validate structure
-> prepare PR -> human review

The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta.

Metric Gaming Resistance

Assume an optimizing agent will learn the harness. Guard against:

Editing evaluation code or rubrics and then using the new version for self-approval
Adding verbose content that pleases a judge but harms skill activation
Citing unretrieved sources
Optimizing aggregate scores while failing a critical dimension
Avoiding failed results in the log

Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review.

Monitoring Agents

Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly tasked otherwise. Monitoring output should report:

Best current candidate
Active jobs or drafts
Last upstream refresh
Failed or stale loops
Disagreements between logs and claimed state
Next action and blocker

Practical Guidance

Harness Design Checklist

Define the objective in one sentence.
Identify locked, editable, append-only, and human-controlled surfaces.
Choose the feedback mechanism: scalar metric, rubric, deterministic tests, human review, or combination.
Define keep, discard, crash, timeout, and review states.
Create a durable thread log before the loop starts.
Add source refresh, mechanism-registry novelty, and pruning rules for long-running loops.
Define what the agent may do without asking and what requires approval.
Validate the harness on one known good and one known bad artifact.

File Layout

research-run/
  THREAD.md
  sources/
    queue.md
    evaluations/
  proposals/
  logs/
    results.tsv
    rejected.md
  drafts/

Use TSV or JSONL for append-only machine-readable logs. Use Markdown for handoffs and reviewer-facing summaries.

Examples

Example 1: Locked metric

An agent optimizes train.py, but prepare.py owns data loading and evaluation. The agent can edit the model but cannot change the metric. Failed experiments are logged and rolled back.

Example 2: Locked rubric

Example 3: Auto-PR without auto-merge

An agent prepares a branch and PR body after passing source, skill, and structure checks. The PR states unresolved risks and waits for human merge approval.

Guidelines

Lock evaluators before starting the loop.
Keep editable surfaces narrow enough for reliable diffs.
Write durable logs before context compaction can erase state.
Report per-dimension scores instead of only aggregate scores.
Require source retrieval before citation.
Add novelty gates for broad search and pruning gates for complex stacks.
Prefer simplification when quality is equal.
Separate PR preparation from merge authority.
Revalidate harness changes with old and new evaluators.
Treat stopped autonomous loops as harness failures, not agent personality quirks.

Gotchas

Mutable evaluator: If the agent can edit the metric, it may optimize the benchmark instead of the task. Keep rubrics and eval code locked during the run.
Chat-only memory: Long runs fail after compaction when plans live only in conversation history. Write thread logs and result files from the start.
No discard record: Without rejected-attempt logs, agents repeat failed ideas. Preserve failures with enough detail to avoid rediscovery.
Complexity accretion: Agents stack changes and rarely remove them. Require pruning rounds and reward equal-quality simplification.
Premature novelty claims: Agents label recombinations as novel. Compare against existing repo skills, source queue, and rejected logs before claiming novelty.
Monitor misreporting: Monitoring agents can summarize stale or inconsistent state. Require them to cite the files or logs behind claims.
Human approval ambiguity: "Prepare a PR" is not "merge a PR." Make approval boundaries explicit in the harness.
Volatile source drift: Fast-moving lab claims age quickly. Put dated evidence in references and schedule revalidation.

Integration

This skill connects to:

evaluation - Rubrics and quality gates provide the locked feedback surface
advanced-evaluation - Pairwise comparison and bias mitigation improve proposal review
filesystem-context - Durable logs, scratchpads, and thread files preserve state
multi-agent-patterns - Researcher, verifier, monitor, and writer agents need isolated contexts
tool-design - Harness tools must expose clear contracts and recovery errors
project-development - File-based pipelines and task-model fit analysis keep loops simple
hosted-agents - Background execution needs sandbox, snapshot, and approval boundaries

References

Internal references:

researcher/README.md - Read when implementing the repo-native research-to-skill operating system
researcher/rubrics/harness-change.md - Read when evaluating changes to an agent harness
researcher/runbooks/autonomous-research-loop.md - Read when running a source-to-skill loop

External resources:

Karpathy autoresearch - Constrained autonomous experiment loop with locked evaluation
Prime Intellect autonomous nanoGPT speedrun - Durable scratchpads, handoffs, monitoring, and autonomy failure modes
AlphaEvolve and FunSearch - LLM-generated candidates paired with systematic evaluators
HELM and LM Evaluation Harness - Transparent, reproducible evaluation infrastructure

Skill Metadata

Created: 2026-05-14 Last Updated: 2026-05-15 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0

Adoption

shaneholloman/skills/harness-engineering

$ install --global

Security Scan Results

SKILL.md

Harness Engineering

When to Activate

Core Concepts

Harness Boundary

Tight Feedback Loops

Durable State

Search Discipline

Mechanism Registry

Governance

Detailed Topics

Autoresearch-Style Loop

Research-To-Skill Loop

Metric Gaming Resistance

Monitoring Agents

Practical Guidance

Harness Design Checklist

File Layout

Examples

Guidelines

Gotchas

Integration

References

Skill Metadata

Related Skills

shaneholloman/latent-briefing

shaneholloman/skill-template

shaneholloman/skills/tool-design

shaneholloman/skills/project-development

shaneholloman/skills/harness-engineering

$ install --global

Security Scan Results

SKILL.md

Harness Engineering

When to Activate

Core Concepts

Harness Boundary

Tight Feedback Loops

Durable State

Search Discipline

Mechanism Registry

Governance

Detailed Topics

Autoresearch-Style Loop

Research-To-Skill Loop

Metric Gaming Resistance

Monitoring Agents

Practical Guidance

Harness Design Checklist

File Layout

Examples

Guidelines

Gotchas

Integration

References

Skill Metadata

Related Skills

shaneholloman/latent-briefing

shaneholloman/skill-template

shaneholloman/skills/tool-design

shaneholloman/skills/project-development