skills/harness-engineering-copilot/SKILL.md
Strategies, workflows, patterns, and best practices for maximizing harness engineering through GitHub Copilot agent customization. Focuses on how to structure context layering, design multi-agent constraint enforcement, implement checklist-driven audit loops, manage prompt and embed budgets, define artifact ownership, scale customization across repos, and evolve harnesses over time. For Copilot-specific syntax, file formats, and configuration rules, defer to the `agent-customization` skill. Triggers: "copilot harness", "copilot harness strategy", "agent-first copilot", "copilot harness patterns", "copilot context layering", "copilot agent fleet", "copilot entropy management", "copilot harness maturity", "scale copilot harness", "copilot customization strategy", "maximize copilot harness", "copilot harness checklist", "agent prompt budget", "instruction compression", "multi-agent audit", "agent variant", "runtime-specific tools", "custom agent runtime".
npx skillsauth add arisng/github-copilot-fc harness-engineering-copilotInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Strategies and patterns for building high-leverage harnesses through Copilot's customization system. A harness is the scaffolding — context layering, constraint enforcement, entropy management, and feedback loops — that makes agents reliably productive.
Related skills:
harness-engineering — generic methodology (tool-agnostic).agent-customization — Copilot file formats, syntax, YAML frontmatter, tool aliases, and configuration rules. Defer all "how do I write this file" questions there.Sources: OpenAI — Harness Engineering | Martin Fowler — Harness Engineering
Use this skill in two modes:
Context is a scarce resource. The harness must deliver the right context at the right time — not dump everything into every interaction.
| Tier | Copilot primitive | When it loads | Budget | What belongs here |
|---|---|---|---|---|
| Always-on | copilot-instructions.md, path-scoped .instructions.md | Every interaction (auto) | ~100 lines repo-wide; ~50 lines per path scope | Architecture map, boundary rules, pointer references |
| On-demand | Agent skills (SKILL.md) | When Copilot detects relevance | <500 lines body + unlimited references | Deep domain knowledge, schemas, runbooks |
| Task-scoped | Prompt files (.prompt.md), agent prompts | When user explicitly invokes | No hard limit (30K chars for agents) | Step-by-step workflows, checklists, batch operations |
agent-customization skillUse the agent-customization decision flow to decide which primitive owns which layer instead of choosing files ad hoc.
| Context layer | Preferred primitive | Why this fits | agent-customization check |
|---|---|---|---|
| Always-on | Workspace instructions | Auto-loaded context should only hold stable routing rules and boundary constraints | Keep applyTo narrow; avoid applyTo: "**" unless the rule truly applies everywhere |
| Always-on, path-specific | File instructions | Architectural rules should follow folder or module boundaries | One instruction file per boundary, with explicit applyTo globs matching repo structure |
| On-demand | Skills | Deep reference material should load only when the task is relevant | Put discovery phrases in description; keep body lean and offload depth to references/assets |
| Task-scoped | Prompts | Repeatable batch workflows should be explicit, not always resident in context | Use prompts for single focused tasks with parameters |
| Task-scoped with isolation | Custom agents | Multi-stage work or restricted tool use needs isolated context and an enforcement contract | Use an agent when you need delegation, context isolation, or different tool boundaries per stage |
Before adding a new file, ask the agent-customization questions in this order:
description, or are you relying on file names and hope?If you cannot answer those four questions clearly, the layer boundary is still underspecified.
copilot-instructions.md exceeds ~100 lines, you're overloading Tier 1.applyTo glob patterns that mirror your module/layer structure. When an agent edits src/server/**, it should receive server-layer rules — not frontend rules.references/ folders. The SKILL.md body stays lean; the agent loads references/schema.md only when actually needed.copilot-instructions.md, don't repeat it in a skill. Use cross-references instead.AGENTS.md → docs/index.md → docs/design-docs/index.md → docs/design-docs/auth.md, that's too deep. Flatten to AGENTS.md → docs/design-docs/auth.md.description as part of the harness. A skill or agent that cannot be discovered by its description is effectively missing from the harness, even if the file exists.Stale context is worse than no context — agents confidently follow outdated rules.
docs/RELIABILITY.md and the file doesn't exist, fail the build.backend.instructions.md hasn't changed but src/server/ has had 50 commits, something is probably stale.A single omniscient agent is an anti-pattern. Design an agent fleet (agent squad/agent swarm) where each agent has a narrow responsibility and minimal tool access.
| Role | Purpose | Tool profile | Invocation |
|---|---|---|---|
| Observer | Detect violations, audit quality, report findings | read + search only | On-demand or scheduled |
| Actor | Implement changes within enforced boundaries | read + edit + search + execute | Task-driven |
| Maintainer | Fix drift, update docs, refactor toward golden principles | read + edit + search | Scheduled/periodic |
Step 1: Identify enforcement surfaces. List every invariant you want to enforce (dependency direction, naming, logging, test coverage, doc freshness). Each surface maps to an Observer agent.
Step 2: Define tool boundaries. For each agent, ask: What's the minimum tool set needed? An architecture reviewer needs read + search, never edit. A test generator needs execute to run tests. Over-provisioning tools undermines the harness.
Step 3: Write agent prompts as enforcement contracts. The agent's markdown body is its enforcement contract — a precise specification of what to check, what to flag, and how to remediate. Structure every enforcement agent prompt as:
1. SCOPE: What files/modules to examine
2. RULES: Numbered invariants to check (reference docs/ for details)
3. PROCESS: Step-by-step verification procedure
4. OUTPUT: Exact format for findings (file, line, violation, remediation)
5. BOUNDARIES: What the agent must NOT do
Step 4: Chain agents for complex workflows. Use subagent invocation (agent tool alias) to compose:
This mirrors the "Ralph Wiggum Loop" pattern: agents review each other's work in a feedback loop until all reviewers are satisfied.
| Pattern | Strategy | When to use |
|---|---|---|
| Read-only observer | tools: ["read", "search"] | Auditing, reviewing, scanning |
| Edit-no-execute | tools: ["read", "edit", "search"] | Documentation updates, config changes, refactoring |
| Full actor | tools: ["read", "edit", "search", "execute"] | Implementation requiring test runs or builds |
| Scoped MCP | tools: ["read", "search", "playwright/screenshot"] | Agents that need one specific external capability |
| Subagent-only | tools: ["read", "search", "agent"] | Orchestrators that delegate all work |
Use runtime to describe the Copilot execution environment of the agent file: VS Code, GitHub Copilot coding agent, CLI-backed coding flows, background agents, cloud agents, or an SDK-hosted flow that adopts one of those schemas. Use platform for OS scope such as Windows, WSL/Linux, or macOS.
When the same specialist role needs different frontmatter, tool namespaces, or delegation mechanics across runtimes, create an agent variant: a runtime-specific .agent.md wrapper over shared instructions.
| Fact | Design implication |
|---|---|
| target is officially vscode or github-copilot | Scope each variant to the runtime family it is meant for instead of assuming one file behaves identically everywhere |
| tools defaults to all tools when omitted; tools: [] disables all tools | Omit tools only when you genuinely want broad capability; otherwise whitelist aggressively |
| Unrecognized tool names are ignored | A mixed-runtime tool list can fail silently; audit tool names per runtime rather than trusting parse success |
| disable-model-invocation is the canonical control and infer is retired | Use the new field when deciding whether a variant is user-selectable, subagent-only, or both |
| VS Code exposes agents, argument-hint, handoffs, and richer tool discovery | Keep VS Code orchestration and guided transitions in the VS Code variant instead of leaking them into shared instructions |
| GitHub Copilot runtime supports mcp-servers and namespaced MCP tools | Put runtime-specific MCP wiring in the github-copilot variant, not in the shared behavior contract |
Split each specialized agent into two layers:
target, tools, agents, handoffs, mcp-servers, and any runtime-specific tool references.The shared layer should say what capability is required, not which runtime-specific tool name to call. Write "read the file", "run tests", or "delegate to the reviewer" rather than embedding a runtime-specific token.
Use the official target values in the variant frontmatter: vscode or github-copilot.
| Runtime family | Frontmatter focus | Tool assignment rule | Delegation model |
|---|---|---|---|
| VS Code | target: vscode, optional agents, argument-hint, handoffs, model | Use VS Code tool names, toolsets, extension tools, or MCP namespaced tools. If agents is specified, include the agent tool. | Explicit subagent wiring via agents |
| GitHub Copilot | target: github-copilot, optional mcp-servers, disable-model-invocation, user-invocable | Prefer official aliases such as read, edit, search, execute, agent, plus namespaced MCP tools like playwright/* or github/* | Delegation through model invocation and custom-agent/task tooling |
| Background or cloud agent flow | Usually inherits the VS Code custom-agent model | Reuse a VS Code-oriented variant unless the host removes or overrides tools | Same as host runtime |
| SDK-hosted flow | Treat as host-defined until proven otherwise | Do not assume .agent.md fields or tool names map 1:1; align the variant to the runtime actually consuming it | Host-dependent |
github-copilot variant silently disappears.tools was omitted unintentionally.disable-model-invocation was not set.Agent prompts in a harness serve a different purpose than general-purpose prompts. They are mechanical contracts, not creative guidance.
src/server/ has a corresponding test in tests/server/" is enforceable.ARCHITECTURE.md or docs/conventions/ for rule details rather than reproducing them. This prevents the prompt from going stale independently.Harnesses usually fail in the operational details, not in the high-level design. Convert repo-specific review checklists into a reusable audit protocol with explicit passes and measurable outputs.
| Pass | Question | Typical checks | |---|---|---| | Structural contract | Does each instruction or agent file contain the sections the workflow depends on? | Required sections present, numbered rules complete, one shared protocol block instead of duplicates, output contracts match referenced artifacts | | Budget discipline | Is every prompt body below the runtime limit with safety margin? | Soft ceiling below hard max, frontmatter excluded from measurement, embedded agent body re-measured after every instruction change | | Consistency | Do names, signals, paths, and schemas line up across files? | Exact agent names, exact delegation targets, one signal vocabulary, one path convention, no deprecated frontmatter keys | | Ownership | Does every writable artifact have exactly one owner? | Planner-only artifacts stay planner-owned, shared progress artifacts are append-only, reviewers observe rather than overwrite | | Distribution | Will the runtime load this deterministically? | Build-time embedding or bundling, no runtime instruction reads for CLI-only bodies, official plugin schema only |
Every audit pass should report findings in a fixed shape:
| Field | Meaning | |---|---| | Artifact | File, folder, or workflow surface being checked | | Invariant | Exact rule that failed | | Evidence | Concrete mismatch: missing section, wrong name, over-budget body, conflicting owner | | Remediation | Smallest change that restores consistency | | Re-run trigger | What future edit should force this pass to run again |
Prompt compression is not cosmetic. It preserves budget for task context and reduces drift.
Always remove:
Always preserve:
For every auditable harness, keep a small measurement set close at hand:
Every agent-generated line of code can introduce drift. Entropy management is the discipline of detecting and correcting drift before it compounds.
Define a small set (5–10) of non-negotiable, mechanically verifiable rules:
| # | Principle | Verification method |
|---|---|---|
| 1 | Shared utilities over hand-rolled helpers | Lint: flag duplicate utility patterns |
| 2 | Parse data at boundaries, never YOLO-probe | Lint: detect untyped API calls |
| 3 | Structured logging with correlation IDs | Lint: flag console.log / raw print |
| 4 | One module = one domain, no cross-domain imports | Structural test: import graph validation |
| 5 | Every public API has a test | Coverage check: map exports → test files |
Key insight: Golden principles must be verifiable by linters, structural tests, or agents — not just documented. If you can't automate the check, it's a guideline, not a golden principle.
| Frequency | What to check | Agent type | |---|---|---| | Per-commit (CI) | Lint rules, structural tests, doc cross-references | Deterministic (linters/tests) | | Daily | Doc freshness, quality score drift, stale TODOs | Maintainer agent | | Weekly | Golden principle deviations, pattern duplication, tech debt inventory | Maintainer agent, batch prompt | | Per-sprint | Full quality scoring across all domains/layers | Observer agent + human review |
Maintain a versioned QUALITY_SCORE.md that grades each domain and layer:
| Domain | Types | Config | Service | Tests | Docs | Overall |
|--------|-------|--------|---------|-------|------|---------|
| Auth | A | A | B | B | C | B |
| Billing| A | B | B | C | D | C+ |
| Search | B | B | C | D | F | D+ |
This gives both humans and agents a map of where debt lives. A maintainer agent can read this and prioritize: "Search.Docs is F — generate missing documentation for the Search domain."
console.log, flag the outliers.src/billing/ changed 30 times this month but docs/billing.md changed 0 times, the docs are likely stale.auth/ in billing/ violates boundary — move shared types to shared/types/", the agent can act on it directly. The error message is the agent's instruction.Agents can only verify what they can observe. Extending the agent's senses beyond static files dramatically increases harness leverage.
| Layer | What the agent can see | Copilot mechanism | Harness value |
|---|---|---|---|
| Static files | Source code, docs, config | Built-in (read/search) | Baseline — always available |
| Build/test output | Compilation errors, test results, lint output | execute tool | Validates correctness |
| Browser state | DOM, screenshots, navigation, console errors | MCP (Playwright) | UI verification without manual testing |
| Runtime telemetry | Logs, metrics, traces | Custom MCP server | Performance and reliability validation |
| Repository state | Issues, PRs, CI status, branch state | MCP (GitHub) | Workflow-aware decisions |
A harness is not a one-time setup. It evolves with the codebase.
| Level | Context | Constraints | Entropy | Legibility |
|---|---|---|---|---|
| 1 — Ad-hoc | No instructions | No enforcement | No cleanup | Static files only |
| 2 — Documented | copilot-instructions.md exists | Conventions documented but not enforced | Manual cleanup | Build/test output |
| 3 — Scoped | Path-scoped instructions + skills | Linters catch some violations | Weekly batch prompts | Browser automation |
| 4 — Enforced | Full three-tier context | Structural tests + CI gates block violations | Daily maintainer agents | Runtime telemetry |
| 5 — Self-healing | Agent-maintained instructions | Agents detect and fix violations autonomously | Continuous GC with quality scoring | Full stack legibility |
harness-engineering skill. Rate each component 0–2.Tie harness maintenance to concrete change events instead of vague periodic reviews.
| Change | Re-run | |---|---| | Instruction body changed | Structural contract, budget discipline, consistency | | Agent frontmatter or delegation changed | Consistency, distribution | | Signal schema changed | Consistency across every agent and instruction | | Workflow paths changed | Structural contract, ownership, consistency | | Plugin or bundle pipeline changed | Distribution and budget discipline |
If a harness cannot tell you which checks to rerun after a change, it is still too implicit.
For organizations with multiple repos:
.github-private repo) that apply everywhere. Keep repo-level agents for domain-specific concerns.| Anti-pattern | Why it fails | Better strategy | |---|---|---| | Monolithic instructions | Crowds out task context at 1000+ lines; rots fast | Three-tier context layering with 100-line Tier 1 | | One omniscient agent | No tool boundaries; can't reason about everything at once | Agent fleet with Observer / Actor / Maintainer roles | | Duplicate context | Same rule in instructions, skills, and agent prompts diverges over time | Single source of truth with cross-references | | Verbal-only enforcement | "We always do X" isn't legible to agents | Encode in linters, tests, or observer agents | | Big-bang cleanup | 20% of the week on "AI slop" doesn't scale | Continuous garbage collection at daily cadence | | Static-only legibility | Agent can't verify its own UI changes or performance | Wire browser automation and observability MCP | | Set-and-forget harness | Codebase evolves, harness doesn't | Treat instructions as code; CI-check freshness | | Open-ended agent contracts | "Review code quality" produces inconsistent results | Precise enforcement contracts with numbered rules | | Checklist-free maintenance | Review quality depends on memory and reviewer taste | Encode reusable audit passes with fixed outputs and triggers |
devops
Programmatically create tldraw whiteboards and visualize them with a self-hosted tldraw instance. Create boards with shapes, text, and connectors, then deploy to a self-hosted server for collaborative editing and gallery management.
tools
Execute Google Cloud Platform operations using the gcloud CLI (and gsutil/bq where applicable). Use when the user wants to: authenticate with GCP, manage GCP resources, deploy applications, configure projects or IAM, view logs, run SQL/BigQuery, or interact with any GCP service from the command line. Triggers on phrases like "gcloud", "Google Cloud CLI", "deploy to GCP", "create a VM", "Cloud Run", "GKE cluster", "Cloud Storage bucket", "set GCP project", "service account", "Cloud Functions", "App Engine deploy", or any request to manage Google Cloud resources via command line.
testing
Grilling session that challenges your plan against the existing domain model, sharpens terminology, and updates documentation (CONTEXT.md, ADRs) inline as decisions crystallise. Use when user wants to stress-test a plan against their project's language and documented decisions.
development
Session-scoped git commit orchestrator that commits only current-session changes and leaves unrelated dirty worktree edits untouched. Inherits git-atomic-commit for atomic grouping and commit message execution, and git-commit-scope-constitution for scope governance and validation. Use when asked to commit this session only or isolate commits from mixed worktree state.