skills/harness-engineering/SKILL.md
--- name: harness-engineering description: This skill should be used when designing autonomous agent harnesses: research loops, evaluation scaffolds, locked and editable surfaces, durable logs, novelty gates, pruning, rollback, PR preparation, and human approval boundaries. --- # Harness Engineering Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. Th
npx skillsauth add shaneholloman/skills-context-engineering skills/harness-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Harness engineering designs the control system around an agent: what it may edit, how it receives feedback, where it writes state, how failures recover, and who can approve irreversible actions. The harness is the difference between a helpful agent session and an autonomous loop that can run for days without corrupting its objective.
Activate this skill when:
Do not activate this skill for adjacent work owned by other skills:
evaluation.tool-design.project-development.hosted-agents.Separate the agent from the environment it operates inside. The agent proposes actions; the harness defines allowed surfaces, feedback, persistence, and promotion rules.
Use four surface classes:
| Surface | Examples | Rule | | --- | --- | --- | | Locked | Eval metric, rubric, validation script, merge policy | Agent may read and propose changes, but cannot score itself with modified rules | | Editable | Skill draft, experiment file, prompt, config under test | Agent may mutate during the loop | | Append-only | Results log, research thread, rejected ideas | Agent may append, not rewrite | | Human-controlled | Merge, production deploy, credentials, destructive operations | Requires explicit human approval |
Autonomy works when feedback is fast, unambiguous, and hard to game. Karpathy's autoresearch is the minimal pattern: one editable file, one locked evaluation file, fixed wall-clock budget, one scalar metric, git rollback, and a durable results log. The lesson is not that every harness needs one metric; it is that ambiguous feedback creates ambiguous autonomy.
For open-ended research-to-skill work, replace the scalar metric with locked rubrics, deterministic structure checks, source traceability, and human review thresholds.
Long-running agents must externalize state. Store plans, source queues, results, failures, and handoffs in files so future agents can resume without relying on chat history. Prime Intellect's autonomous nanoGPT work showed the value of durable scratchpads and THREAD.md-style logs for recovery, monitoring, and audit.
Use append-only logs for:
Agents tend to exploit the nearest surface, stack complexity, and under-run pruning. Add explicit search rules:
For research-to-skill systems, track accepted mechanisms separately from prose. A mechanism record should include a stable mechanism_id, owning_skill, status, activation scenario, behavior change, evidence, and failure modes. Novelty gates should compare against this registry before using broader corpus overlap, because keyword overlap catches stale phrasing while mechanism comparison catches real duplication.
Autonomous agents may prepare PRs, but governance must be explicit. They can draft changes, run checks, and write PR summaries. They should not merge, deploy, or push without human approval unless the user has explicitly granted that permission for the specific action.
Use this pattern when optimizing an artifact against a stable evaluator:
read locked context -> choose hypothesis -> edit allowed surface -> commit/checkpoint
-> run evaluator -> log result -> keep if better -> discard or rollback if worse
-> repeat
Required properties:
Use this pattern when sources become skill changes:
discover -> retrieve -> gate -> score -> extract mechanism
-> map to existing or new skill -> draft proposal -> validate structure
-> prepare PR -> human review
The locked evaluator is a combination of source rubrics, skill-change rubrics, structure checks, and reviewer approval. The editable artifact is the proposed skill delta.
Assume an optimizing agent will learn the harness. Guard against:
Mitigation: lock rubrics per run, report per-dimension scores, require source retrieval evidence, preserve rejected attempts, and route governance changes to human review.
Use monitoring agents for long runs, but restrict them to read-only reporting unless explicitly tasked otherwise. Monitoring output should report:
research-run/
THREAD.md
sources/
queue.md
evaluations/
proposals/
logs/
results.tsv
rejected.md
drafts/
Use TSV or JSONL for append-only machine-readable logs. Use Markdown for handoffs and reviewer-facing summaries.
Example 1: Locked metric
An agent optimizes train.py, but prepare.py owns data loading and evaluation. The agent can edit the model but cannot change the metric. Failed experiments are logged and rolled back.
Example 2: Locked rubric
An agent evaluates a new Anthropic or OpenAI engineering post, but the source curation rubric is locked for the run. If the source passes, the agent drafts a skill proposal. It cannot lower the rubric threshold to admit the source.
Example 3: Auto-PR without auto-merge
An agent prepares a branch and PR body after passing source, skill, and structure checks. The PR states unresolved risks and waits for human merge approval.
This skill connects to:
Internal references:
researcher/README.md - Read when implementing the repo-native research-to-skill operating systemresearcher/rubrics/harness-change.md - Read when evaluating changes to an agent harnessresearcher/runbooks/autonomous-research-loop.md - Read when running a source-to-skill loopExternal resources:
autoresearch - Constrained autonomous experiment loop with locked evaluationCreated: 2026-05-14 Last Updated: 2026-05-15 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0
data-ai
This skill should be used when the user asks to "share memory between agents", "KV cache compaction for multi-agent", "orchestrator worker context", "latent briefing", "reduce worker tokens", "cross-agent memory without summarization", or discusses Attention Matching compaction, recursive language models with workers, or token explosion in hierarchical agents.
data-ai
Template for creating new Agent Skills for context engineering. Use this template when adding new skills to the collection.
tools
--- name: tool-design description: This skill should be used for the tool-interface layer of an agent system specifically: writing tool descriptions agents can route on, designing tool schemas and response formats, naming conventions, actionable error recovery messages, MCP server design, tool-set consolidation, and deciding when to add or remove an individual tool. Use this when the unit of work is a single tool or a set of tools. Route project-shape, pipeline architecture, and task-model-fit d
development
--- name: project-development description: This skill should be used for project-level decisions about LLM-powered systems: whether an LLM is the right primitive for the task at hand, the shape of a multi-stage batch or agent pipeline, token and cost estimation, choosing between single-agent and multi-agent at the project level, structured output design for downstream parsing, and structuring agent-assisted iteration. Use this when the unit of work is a whole project or a multi-stage pipeline. R