Overview

After evolving os-architect or its downstream agents, you need proof that the changes actually work. This skill dispatches os-architect in single-shot simulation mode for each test scenario and verifies artifact presence — not by reading the transcript, but by checking that expected files exist or expected content appears in output.

Evolution is verified by artifact presence, not by transcript review.

Artifact Verification Table

| Evolution Type | What to Check | |---|---| | Path C (Gap Fill) | SKILL.md present at expected path | | Path B (Update) | tasks/todo/<slug>-plan.md AND tasks/todo/copilot_prompt_<slug>.md written | | Path A+ (No-op) | No new files written; HANDOFF_BLOCK contains STATUS: complete | | Category 3 (Lab Setup) | improvement/run-config.json written AND HANDOFF_BLOCK emitted | | HANDOFF_BLOCK integrity | All 7 fields present: INTENT, TARGET, PATH, DISPATCH, STATUS, OUTPUTS, NEXT_ACTION | | Confidence model | Low confidence prompt → clarifying question appears before Phase 2 audit |

Phase 1 — Resolve Test Inputs

If invoked with all, find test scenarios:

ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort

If invoked with a specific file, verify it exists and is valid JSON with required fields:

python3 -c "
import json, sys
d = json.load(open('$SCENARIO_FILE'))
required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check']
missing = [f for f in required if f not in d]
if missing:
    print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1)
print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}')
"

If no scenarios found and no file given, report:

"No test scenarios found. Create scenario JSON files in temp/os-evolution-verifier/scenarios/ or run context-bundler (red-team mode) to generate them from os-architect-agent.md."

Phase 2 — Dispatch os-architect (Single-Shot Simulation)

For each scenario, dispatch os-architect via Copilot CLI in simulation mode. The system prompt is the full content of plugins/agent-agentic-os/agents/os-architect-agent.md. The user turn is the scenario prompt.

Step 2 — Dispatch via copilot-cli-agent skill

Invoke the copilot-cli-agent skill with the following parameters:

Heartbeat check (always first):
- prompt_file: /dev/null
- context: /dev/null
- output: temp/os-evolution-verifier/heartbeat.md
- instruction: "HEARTBEAT CHECK: Respond HEARTBEAT_OK only."
- model: gpt-5-mini
Verify heartbeat before premium dispatch:
```
grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \
  { echo "HEARTBEAT FAILED — aborting test run"; exit 1; }
```
Main Dispatch (simulation of os-architect):
- prompt_file: plugins/agent-agentic-os/agents/os-architect-agent.md
- context: /dev/null
- output: temp/os-evolution-verifier/output_${SCENARIO_ID}.md
- instruction: "$SCENARIO_PROMPT"
- model: claude-sonnet-4.6
- mode: non-interactive
After dispatch, verify output before claiming complete:
```
wc -l "temp/os-evolution-verifier/output_${SCENARIO_ID}.md"
test -s "temp/os-evolution-verifier/output_${SCENARIO_ID}.md" || echo "ERROR: empty output — copilot-cli-agent dispatch failed"
```

Wait for completion. Check output file is non-empty (expect 100+ lines for a real run):

wc -l "$OUTPUT_FILE"

Phase 3 — Artifact Verification

Run the artifact check specified in the scenario's artifact_check field.

HANDOFF_BLOCK integrity check

# All 7 required fields must appear in output
FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:")
MISSING=()
for field in "${FIELDS[@]}"; do
  grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field")
done

if [ ${#MISSING[@]} -eq 0 ]; then
  echo "PASS: HANDOFF_BLOCK has all 7 required fields"
else
  echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}"
fi

File existence check (Path B/C)

# Check for expected artifact files written by os-evolution-planner
EXPECTED_FILE="$ARTIFACT_PATH"
if [ -f "$EXPECTED_FILE" ]; then
  echo "PASS: Artifact found at $EXPECTED_FILE"
  wc -l "$EXPECTED_FILE"
else
  echo "FAIL: Expected artifact not found: $EXPECTED_FILE"
fi

No-op check (Path A+)

# Verify STATUS: complete in HANDOFF_BLOCK and no new plan files created
grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete"
PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l)
[ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created"

Confidence model check

# Low confidence prompt must produce a clarifying question before Phase 2
grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low"
# Check that Phase 2 audit was NOT started (no "Checking existing" or "audit" language before clarification)
CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
[ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \
  echo "PASS: Clarifying question appeared before audit" || \
  echo "FAIL: Audit started before clarifying question"

Phase 4 — Record Result

Append to temp/os-evolution-verifier/test-report.md:

## $SCENARIO_ID — $SCENARIO_NAME

**Status**: [PASS | FAIL]
**Path**: [A / A+ / B / C]
**Prompt**: `$SCENARIO_PROMPT`
**Artifact check**: $ARTIFACT_CHECK_COMMAND
**Evidence**:

[grep or file-exists output]

**Failure mode tested**: $FAILURE_MODE
**Time**: $ELAPSED seconds
---

Phase 5 — Summary Report

After all scenarios run, write summary to temp/os-evolution-verifier/test-report.md.

Each scenario result uses the structured EVOLUTION_VERIFICATION block:

## EVOLUTION_VERIFICATION
SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id]
SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected]
STATUS: [complete | intentional_pause | crashed]
PATH: [A | A+ | B | C | pending]
OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field]
OUTPUTS_VERIFIED: [N — count that passed artifact check]
OUTPUTS_MISSING: [list of missing file paths, or "none"]
HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false]
SCAFFOLD_VALID: [true | false | N/A]
PLAN_WRITTEN: [true | false | N/A]
DISPATCH_RAN: [true | false | N/A]
VERDICT: [PASS | PARTIAL | FAIL]
NOTES: [any file-level anomalies or ordering violations]

STATUS field values — required, disambiguates SESSION_COMPLETE: false:

| STATUS | When to use | VERDICT | |--------|-------------|---------| | complete | SESSION_COMPLETE: true; HANDOFF_BLOCK present and valid | PASS or PARTIAL | | intentional_pause | SESSION_COMPLETE: false; agent asked a clarifying question or hit a documented HARD-GATE; output > 50 lines | PASS (gate behavior is correct) | | crashed | SESSION_COMPLETE: false; output < 50 lines, no clarifying question, no HANDOFF_BLOCK, or run_agent.py returned non-zero | FAIL |

When SESSION_COMPLETE: false and STATUS: intentional_pause, HANDOFF_BLOCK_VALID must be N/A — a missing HANDOFF_BLOCK is expected behavior, not a schema violation.

When SESSION_COMPLETE: false and STATUS: crashed, VERDICT must be FAIL regardless of any other fields — a silent crash must never be reported as PARTIAL or PASS.

Use PARTIAL when some outputs are present but not all — it pinpoints exactly which workstream failed rather than collapsing everything into a binary pass/fail.

Binary PASS/FAIL Contract

A run PASSES only if ALL of the following are true:

At least 1 artifact is present at a declared OUTPUTS path
HANDOFF_BLOCK contains all 7 required fields
STATUS is not crashed
EVOLUTION_VERIFICATION VERDICT is PASS (PARTIAL counts as FAIL for gating — logged but does not unblock pipeline)

A run FAILS if any condition above is not met, OR if VERDICT is PARTIAL. PARTIAL means outputs are incomplete — this is a FAIL for any gating decision, even though it is logged separately for diagnostic purposes.

Adversarial threshold: When running WS-N failure injection scenarios (N-01 through N-06), the verifier must produce FAIL verdicts on at least 4 of 6 adversarial inputs. A verifier that passes all adversarial inputs is not operational — it is only checking the happy path.

Critical scenario requirement: N-04 (malformed run-config), N-05 (truncated plan), and N-06 (bad evals schema) MUST ALL produce FAIL verdicts. These test structural failures, not just crashes. A verifier that catches crashes (N-01/N-02/N-03) but misses structural failures (N-04/N-05/N-06) has a ceiling of 3/6 and is not detecting the important failure modes.

Follow with the aggregate summary:

## Run Summary

Total: N scenarios
PASS: X
PARTIAL: Y
FAIL: Z

### Failed / Partial Tests
- TEST-N: <name> — <what specifically failed>

### Evolution Gaps Found
[For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case]

### Recommended Actions
1. [Priority: Critical] Fix <gap> in os-architect-agent.md
2. [Priority: High] Add new eval case for <scenario>
3. [Priority: Medium] Create new skill <skill-name> for <capability>

Phase 6 — Persist to Experiment Log

After Phase 5 summary is written, always call os-experiment-log to persist the run:

python3 scripts/experiment_log.py append \
  --report temp/os-evolution-verifier/test-report.md \
  --triggered-by os-evolution-verifier

This is not optional. temp/ is ephemeral — if the log is not appended immediately after the run, the results are lost when the shell restarts. The experiment log is the durable record.

Scenario File Format

Test scenarios live in temp/os-evolution-verifier/scenarios/:

{
  "id": "TEST-1",
  "name": "Path C — monitoring agent gap fill",
  "category": 4,
  "path": "C",
  "prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.",
  "expected_artifact": "tasks/todo/copilot_prompt_",
  "artifact_check": "file_prefix",
  "expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file",
  "failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner"
}

Smoke Tests

Three fast verification cases to confirm the skill itself is working:

Smoke 1 — Heartbeat check: Run heartbeat only, confirm HEARTBEAT_OK in output. Expected: heartbeat.md non-empty, contains HEARTBEAT_OK. Time: <30s.

Smoke 2 — Single scenario dry run: Run TEST-1 (Path C gap fill). Confirm output file is >100 lines. Time: <3 min.

Smoke 3 — HANDOFF_BLOCK field scan: On an existing output file, run the 7-field grep. Confirm all 7 fields found. Time: <5s.

Gotchas

Output files must be >100 lines: A Copilot CLI call that returns <50 lines usually means the model hit a refusal, the system prompt was too long, or the heartbeat was skipped. Always heartbeat first and always check wc -l before running artifact checks.
Single-shot simulation ≠ real dispatch: os-architect in simulation mode cannot write files to disk (no Write tool access during simulation). Artifact checks for Path B/C test whether the agent PROPOSES the correct files in its output, not whether they exist on disk. Real file-existence checks only apply when os-architect is run with full tool access.
HANDOFF_BLOCK field order matters for grep: Use grep -q "FIELD:" not grep -q "FIELD" — otherwise partial matches on word fragments will produce false positives.
Confidence model check is order-sensitive: The clarifying question must appear BEFORE any audit output. Line-number comparison is required; simple grep -q is insufficient.
temp/ files are ephemeral — distinguish shell restart from crash: If a run was interrupted by a shell restart and temp/copilot_output_*.md is missing, set STATUS: intentional_pause, VERDICT: PARTIAL (inconclusive) — the run never completed. If the file is present but < 50 lines AND run_agent.py returned non-zero, set STATUS: crashed, VERDICT: FAIL — the agent halted unexpectedly. Never report a silent crash as PARTIAL.
OUTPUTS field path normalization: HANDOFF_BLOCK OUTPUTS lists paths relative to project root. Normalize before checking (strip leading ./, resolve ~). A path mismatch between declared and actual is a schema drift signal, not a file-missing signal.
Category 5 tests produce two sequential dispatches: When verifying Category 5 output, check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.

Overview

Evolution is verified by artifact presence, not by transcript review.

Artifact Verification Table

Phase 1 — Resolve Test Inputs

If invoked with all, find test scenarios:

ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort

If invoked with a specific file, verify it exists and is valid JSON with required fields:

python3 -c "
import json, sys
d = json.load(open('$SCENARIO_FILE'))
required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check']
missing = [f for f in required if f not in d]
if missing:
    print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1)
print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}')
"

If no scenarios found and no file given, report:

"No test scenarios found. Create scenario JSON files in temp/os-evolution-verifier/scenarios/ or run context-bundler (red-team mode) to generate them from os-architect-agent.md."

Phase 2 — Dispatch os-architect (Single-Shot Simulation)

Step 2 — Dispatch via copilot-cli-agent skill

Invoke the copilot-cli-agent skill with the following parameters:

Heartbeat check (always first):
- prompt_file: /dev/null
- context: /dev/null
- output: temp/os-evolution-verifier/heartbeat.md
- instruction: "HEARTBEAT CHECK: Respond HEARTBEAT_OK only."
- model: gpt-5-mini
Verify heartbeat before premium dispatch:
```
grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \
  { echo "HEARTBEAT FAILED — aborting test run"; exit 1; }
```
Main Dispatch (simulation of os-architect):
- prompt_file: plugins/agent-agentic-os/agents/os-architect-agent.md
- context: /dev/null
- output: temp/os-evolution-verifier/output_${SCENARIO_ID}.md
- instruction: "$SCENARIO_PROMPT"
- model: claude-sonnet-4.6
- mode: non-interactive
After dispatch, verify output before claiming complete:
```
wc -l "temp/os-evolution-verifier/output_${SCENARIO_ID}.md"
test -s "temp/os-evolution-verifier/output_${SCENARIO_ID}.md" || echo "ERROR: empty output — copilot-cli-agent dispatch failed"
```

Wait for completion. Check output file is non-empty (expect 100+ lines for a real run):

wc -l "$OUTPUT_FILE"

Phase 3 — Artifact Verification

Run the artifact check specified in the scenario's artifact_check field.

HANDOFF_BLOCK integrity check

# All 7 required fields must appear in output
FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:")
MISSING=()
for field in "${FIELDS[@]}"; do
  grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field")
done

if [ ${#MISSING[@]} -eq 0 ]; then
  echo "PASS: HANDOFF_BLOCK has all 7 required fields"
else
  echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}"
fi

File existence check (Path B/C)

# Check for expected artifact files written by os-evolution-planner
EXPECTED_FILE="$ARTIFACT_PATH"
if [ -f "$EXPECTED_FILE" ]; then
  echo "PASS: Artifact found at $EXPECTED_FILE"
  wc -l "$EXPECTED_FILE"
else
  echo "FAIL: Expected artifact not found: $EXPECTED_FILE"
fi

No-op check (Path A+)

# Verify STATUS: complete in HANDOFF_BLOCK and no new plan files created
grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete"
PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l)
[ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created"

Confidence model check

# Low confidence prompt must produce a clarifying question before Phase 2
grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low"
# Check that Phase 2 audit was NOT started (no "Checking existing" or "audit" language before clarification)
CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
[ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \
  echo "PASS: Clarifying question appeared before audit" || \
  echo "FAIL: Audit started before clarifying question"

Phase 4 — Record Result

Append to temp/os-evolution-verifier/test-report.md:

## $SCENARIO_ID — $SCENARIO_NAME

**Status**: [PASS | FAIL]
**Path**: [A / A+ / B / C]
**Prompt**: `$SCENARIO_PROMPT`
**Artifact check**: $ARTIFACT_CHECK_COMMAND
**Evidence**:

[grep or file-exists output]

**Failure mode tested**: $FAILURE_MODE
**Time**: $ELAPSED seconds
---

Phase 5 — Summary Report

After all scenarios run, write summary to temp/os-evolution-verifier/test-report.md.

Each scenario result uses the structured EVOLUTION_VERIFICATION block:

## EVOLUTION_VERIFICATION
SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id]
SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected]
STATUS: [complete | intentional_pause | crashed]
PATH: [A | A+ | B | C | pending]
OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field]
OUTPUTS_VERIFIED: [N — count that passed artifact check]
OUTPUTS_MISSING: [list of missing file paths, or "none"]
HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false]
SCAFFOLD_VALID: [true | false | N/A]
PLAN_WRITTEN: [true | false | N/A]
DISPATCH_RAN: [true | false | N/A]
VERDICT: [PASS | PARTIAL | FAIL]
NOTES: [any file-level anomalies or ordering violations]

STATUS field values — required, disambiguates SESSION_COMPLETE: false:

When SESSION_COMPLETE: false and STATUS: intentional_pause, HANDOFF_BLOCK_VALID must be N/A — a missing HANDOFF_BLOCK is expected behavior, not a schema violation.

When SESSION_COMPLETE: false and STATUS: crashed, VERDICT must be FAIL regardless of any other fields — a silent crash must never be reported as PARTIAL or PASS.

Use PARTIAL when some outputs are present but not all — it pinpoints exactly which workstream failed rather than collapsing everything into a binary pass/fail.

Binary PASS/FAIL Contract

A run PASSES only if ALL of the following are true:

At least 1 artifact is present at a declared OUTPUTS path
HANDOFF_BLOCK contains all 7 required fields
STATUS is not crashed
EVOLUTION_VERIFICATION VERDICT is PASS (PARTIAL counts as FAIL for gating — logged but does not unblock pipeline)

Follow with the aggregate summary:

## Run Summary

Total: N scenarios
PASS: X
PARTIAL: Y
FAIL: Z

### Failed / Partial Tests
- TEST-N: <name> — <what specifically failed>

### Evolution Gaps Found
[For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case]

### Recommended Actions
1. [Priority: Critical] Fix <gap> in os-architect-agent.md
2. [Priority: High] Add new eval case for <scenario>
3. [Priority: Medium] Create new skill <skill-name> for <capability>

Phase 6 — Persist to Experiment Log

After Phase 5 summary is written, always call os-experiment-log to persist the run:

python3 scripts/experiment_log.py append \
  --report temp/os-evolution-verifier/test-report.md \
  --triggered-by os-evolution-verifier

This is not optional. temp/ is ephemeral — if the log is not appended immediately after the run, the results are lost when the shell restarts. The experiment log is the durable record.

Scenario File Format

Test scenarios live in temp/os-evolution-verifier/scenarios/:

{
  "id": "TEST-1",
  "name": "Path C — monitoring agent gap fill",
  "category": 4,
  "path": "C",
  "prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.",
  "expected_artifact": "tasks/todo/copilot_prompt_",
  "artifact_check": "file_prefix",
  "expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file",
  "failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner"
}

Smoke Tests

Three fast verification cases to confirm the skill itself is working:

Smoke 1 — Heartbeat check: Run heartbeat only, confirm HEARTBEAT_OK in output. Expected: heartbeat.md non-empty, contains HEARTBEAT_OK. Time: <30s.

Smoke 2 — Single scenario dry run: Run TEST-1 (Path C gap fill). Confirm output file is >100 lines. Time: <3 min.

Smoke 3 — HANDOFF_BLOCK field scan: On an existing output file, run the 7-field grep. Confirm all 7 fields found. Time: <5s.

Gotchas

Output files must be >100 lines: A Copilot CLI call that returns <50 lines usually means the model hit a refusal, the system prompt was too long, or the heartbeat was skipped. Always heartbeat first and always check wc -l before running artifact checks.
Single-shot simulation ≠ real dispatch: os-architect in simulation mode cannot write files to disk (no Write tool access during simulation). Artifact checks for Path B/C test whether the agent PROPOSES the correct files in its output, not whether they exist on disk. Real file-existence checks only apply when os-architect is run with full tool access.
HANDOFF_BLOCK field order matters for grep: Use grep -q "FIELD:" not grep -q "FIELD" — otherwise partial matches on word fragments will produce false positives.
Confidence model check is order-sensitive: The clarifying question must appear BEFORE any audit output. Line-number comparison is required; simple grep -q is insufficient.
temp/ files are ephemeral — distinguish shell restart from crash: If a run was interrupted by a shell restart and temp/copilot_output_*.md is missing, set STATUS: intentional_pause, VERDICT: PARTIAL (inconclusive) — the run never completed. If the file is present but < 50 lines AND run_agent.py returned non-zero, set STATUS: crashed, VERDICT: FAIL — the agent halted unexpectedly. Never report a silent crash as PARTIAL.
OUTPUTS field path normalization: HANDOFF_BLOCK OUTPUTS lists paths relative to project root. Normalize before checking (strip leading ./, resolve ~). A path mismatch between declared and actual is a schema drift signal, not a file-missing signal.
Category 5 tests produce two sequential dispatches: When verifying Category 5 output, check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.

Adoption

richfrem/os-evolution-verifier

$ install --global

Security Scan Results

SKILL.md

Overview

Artifact Verification Table

Phase 1 — Resolve Test Inputs

Phase 2 — Dispatch os-architect (Single-Shot Simulation)

Step 2 — Dispatch via copilot-cli-agent skill

Phase 3 — Artifact Verification

HANDOFF_BLOCK integrity check

File existence check (Path B/C)

No-op check (Path A+)

Confidence model check

Phase 4 — Record Result

Phase 5 — Summary Report

Binary PASS/FAIL Contract

Phase 6 — Persist to Experiment Log

Scenario File Format

Smoke Tests

Gotchas

Related Skills

richfrem/issue-worktree-agent

richfrem/issue-pr-lifecycle-agent

richfrem/github-issue-prioritizer

richfrem/github-issue-backlog-agent

richfrem/os-evolution-verifier

$ install --global

Security Scan Results

SKILL.md

Overview

Artifact Verification Table

Phase 1 — Resolve Test Inputs

Phase 2 — Dispatch os-architect (Single-Shot Simulation)

Step 2 — Dispatch via copilot-cli-agent skill

Phase 3 — Artifact Verification

HANDOFF_BLOCK integrity check

File existence check (Path B/C)

No-op check (Path A+)

Confidence model check

Phase 4 — Record Result

Phase 5 — Summary Report

Binary PASS/FAIL Contract

Phase 6 — Persist to Experiment Log

Scenario File Format

Smoke Tests

Gotchas

Related Skills

richfrem/issue-worktree-agent

richfrem/issue-pr-lifecycle-agent

richfrem/github-issue-prioritizer

richfrem/github-issue-backlog-agent