plugins/agent-agentic-os/skills/os-evolution-verifier/SKILL.md
Verifies that os-architect actually causes evolution — not just words. Dispatches os-architect in single-shot simulation mode for a given test scenario, then checks for real artifact presence (new files, HANDOFF_BLOCK, plan files). Reports PASS / FAIL with grep evidence. Accumulates results into a test report. Use after any changes to os-architect, os-evolution-planner, or improvement-intake-agent.
npx skillsauth add richfrem/agent-plugins-skills os-evolution-verifierInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
After evolving os-architect or its downstream agents, you need proof that the changes actually work. This skill dispatches os-architect in single-shot simulation mode for each test scenario and verifies artifact presence — not by reading the transcript, but by checking that expected files exist or expected content appears in output.
Evolution is verified by artifact presence, not by transcript review.
| Evolution Type | What to Check |
|---|---|
| Path C (Gap Fill) | SKILL.md present at expected path |
| Path B (Update) | tasks/todo/<slug>-plan.md AND tasks/todo/copilot_prompt_<slug>.md written |
| Path A+ (No-op) | No new files written; HANDOFF_BLOCK contains STATUS: complete |
| Category 3 (Lab Setup) | improvement/run-config.json written AND HANDOFF_BLOCK emitted |
| HANDOFF_BLOCK integrity | All 7 fields present: INTENT, TARGET, PATH, DISPATCH, STATUS, OUTPUTS, NEXT_ACTION |
| Confidence model | Low confidence prompt → clarifying question appears before Phase 2 audit |
If invoked with all, find test scenarios:
ls temp/os-evolution-verifier/scenarios/*.json 2>/dev/null | sort
If invoked with a specific file, verify it exists and is valid JSON with required fields:
python3 -c "
import json, sys
d = json.load(open('$SCENARIO_FILE'))
required = ['id', 'name', 'path', 'prompt', 'expected_artifact', 'artifact_check']
missing = [f for f in required if f not in d]
if missing:
print(f'SCHEMA ERROR: missing fields: {missing}'); sys.exit(1)
print(f'Scenario: {d[\"id\"]} — {d[\"name\"]}')
"
If no scenarios found and no file given, report:
"No test scenarios found. Create scenario JSON files in
temp/os-evolution-verifier/scenarios/or run the red-team-bundler to generate them fromos-architect-agent.md."
For each scenario, dispatch os-architect via Copilot CLI in simulation mode.
The system prompt is the full content of plugins/agent-agentic-os/agents/os-architect-agent.md.
The user turn is the scenario prompt.
# 1. Heartbeat (free model — always first)
python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
/dev/null /dev/null temp/os-evolution-verifier/heartbeat.md \
"HEARTBEAT CHECK: Respond HEARTBEAT_OK only."
# Confirm heartbeat before dispatching
grep -q "HEARTBEAT_OK" temp/os-evolution-verifier/heartbeat.md || \
{ echo "HEARTBEAT FAILED — aborting test run"; exit 1; }
# 2. Dispatch os-architect in single-shot simulation mode
OUTPUT_FILE="temp/os-evolution-verifier/output_${SCENARIO_ID}.md"
python3 plugins/cli-agents/skills/copilot-cli-agent/scripts/run_agent.py \
plugins/agent-agentic-os/agents/os-architect-agent.md \
/dev/null \
"$OUTPUT_FILE" \
"$SCENARIO_PROMPT" \
claude-sonnet-4.6
Wait for completion. Check output file is non-empty (expect 100+ lines for a real run):
wc -l "$OUTPUT_FILE"
Run the artifact check specified in the scenario's artifact_check field.
# All 7 required fields must appear in output
FIELDS=("INTENT:" "TARGET:" "PATH:" "DISPATCH:" "STATUS:" "OUTPUTS:" "NEXT_ACTION:")
MISSING=()
for field in "${FIELDS[@]}"; do
grep -q "$field" "$OUTPUT_FILE" || MISSING+=("$field")
done
if [ ${#MISSING[@]} -eq 0 ]; then
echo "PASS: HANDOFF_BLOCK has all 7 required fields"
else
echo "FAIL: HANDOFF_BLOCK missing: ${MISSING[*]}"
fi
# Check for expected artifact files written by os-evolution-planner
EXPECTED_FILE="$ARTIFACT_PATH"
if [ -f "$EXPECTED_FILE" ]; then
echo "PASS: Artifact found at $EXPECTED_FILE"
wc -l "$EXPECTED_FILE"
else
echo "FAIL: Expected artifact not found: $EXPECTED_FILE"
fi
# Verify STATUS: complete in HANDOFF_BLOCK and no new plan files created
grep -q "STATUS: complete" "$OUTPUT_FILE" && echo "PASS: Status is complete" || echo "FAIL: Status not complete"
PLAN_COUNT=$(find tasks/todo -name "*.md" -newer "$OUTPUT_FILE" 2>/dev/null | wc -l)
[ "$PLAN_COUNT" -eq 0 ] && echo "PASS: No new task files written" || echo "FAIL: $PLAN_COUNT unexpected task files created"
# Low confidence prompt must produce a clarifying question before Phase 2
grep -q "Confidence: Low" "$OUTPUT_FILE" && echo "PASS: Confidence: Low detected" || echo "FAIL: Confidence field not Low"
# Check that Phase 2 audit was NOT started (no "Checking existing" or "audit" language before clarification)
CLARIFICATION_LINE=$(grep -n "?" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
AUDIT_LINE=$(grep -n "Checking existing\|audit\|Phase 2" "$OUTPUT_FILE" | head -1 | cut -d: -f1)
[ -z "$AUDIT_LINE" ] || [ "$CLARIFICATION_LINE" -lt "$AUDIT_LINE" ] && \
echo "PASS: Clarifying question appeared before audit" || \
echo "FAIL: Audit started before clarifying question"
Append to temp/os-evolution-verifier/test-report.md:
## $SCENARIO_ID — $SCENARIO_NAME
**Status**: [PASS | FAIL]
**Path**: [A / A+ / B / C]
**Prompt**: `$SCENARIO_PROMPT`
**Artifact check**: $ARTIFACT_CHECK_COMMAND
**Evidence**:
[grep or file-exists output]
**Failure mode tested**: $FAILURE_MODE
**Time**: $ELAPSED seconds
---
After all scenarios run, write summary to temp/os-evolution-verifier/test-report.md.
Each scenario result uses the structured EVOLUTION_VERIFICATION block:
## EVOLUTION_VERIFICATION
SESSION_ID: [from HANDOFF_BLOCK TARGET field or scenario id]
SESSION_COMPLETE: [true | false — false means session still in Phase 1/2, no HANDOFF_BLOCK expected]
STATUS: [complete | intentional_pause | crashed]
PATH: [A | A+ | B | C | pending]
OUTPUTS_DECLARED: [N — count of files mentioned in HANDOFF_BLOCK OUTPUTS field]
OUTPUTS_VERIFIED: [N — count that passed artifact check]
OUTPUTS_MISSING: [list of missing file paths, or "none"]
HANDOFF_BLOCK_VALID: [true | false | N/A — N/A when SESSION_COMPLETE: false]
SCAFFOLD_VALID: [true | false | N/A]
PLAN_WRITTEN: [true | false | N/A]
DISPATCH_RAN: [true | false | N/A]
VERDICT: [PASS | PARTIAL | FAIL]
NOTES: [any file-level anomalies or ordering violations]
STATUS field values — required, disambiguates SESSION_COMPLETE: false:
| STATUS | When to use | VERDICT |
|--------|-------------|---------|
| complete | SESSION_COMPLETE: true; HANDOFF_BLOCK present and valid | PASS or PARTIAL |
| intentional_pause | SESSION_COMPLETE: false; agent asked a clarifying question or hit a documented HARD-GATE; output > 50 lines | PASS (gate behavior is correct) |
| crashed | SESSION_COMPLETE: false; output < 50 lines, no clarifying question, no HANDOFF_BLOCK, or run_agent.py returned non-zero | FAIL |
When SESSION_COMPLETE: false and STATUS: intentional_pause, HANDOFF_BLOCK_VALID must be N/A —
a missing HANDOFF_BLOCK is expected behavior, not a schema violation.
When SESSION_COMPLETE: false and STATUS: crashed, VERDICT must be FAIL regardless of
any other fields — a silent crash must never be reported as PARTIAL or PASS.
Use PARTIAL when some outputs are present but not all — it pinpoints exactly which workstream failed rather than collapsing everything into a binary pass/fail.
A run PASSES only if ALL of the following are true:
crashedA run FAILS if any condition above is not met, OR if VERDICT is PARTIAL. PARTIAL means outputs are incomplete — this is a FAIL for any gating decision, even though it is logged separately for diagnostic purposes.
Adversarial threshold: When running WS-N failure injection scenarios (N-01 through N-06), the verifier must produce FAIL verdicts on at least 4 of 6 adversarial inputs. A verifier that passes all adversarial inputs is not operational — it is only checking the happy path.
Critical scenario requirement: N-04 (malformed run-config), N-05 (truncated plan), and N-06 (bad evals schema) MUST ALL produce FAIL verdicts. These test structural failures, not just crashes. A verifier that catches crashes (N-01/N-02/N-03) but misses structural failures (N-04/N-05/N-06) has a ceiling of 3/6 and is not detecting the important failure modes.
Follow with the aggregate summary:
## Run Summary
Total: N scenarios
PASS: X
PARTIAL: Y
FAIL: Z
### Failed / Partial Tests
- TEST-N: <name> — <what specifically failed>
### Evolution Gaps Found
[For each FAIL/PARTIAL: classify as spec fix / new skill needed / new eval case]
### Recommended Actions
1. [Priority: Critical] Fix <gap> in os-architect-agent.md
2. [Priority: High] Add new eval case for <scenario>
3. [Priority: Medium] Create new skill <skill-name> for <capability>
After Phase 5 summary is written, always call os-experiment-log to persist the run:
python3 scripts/experiment_log.py append \
--report temp/os-evolution-verifier/test-report.md \
--triggered-by os-evolution-verifier
This is not optional. temp/ is ephemeral — if the log is not appended immediately after
the run, the results are lost when the shell restarts. The experiment log is the durable record.
Test scenarios live in temp/os-evolution-verifier/scenarios/:
{
"id": "TEST-1",
"name": "Path C — monitoring agent gap fill",
"category": 4,
"path": "C",
"prompt": "There's no skill for automatically monitoring plugin health and flagging stale evals — I want to create one.",
"expected_artifact": "tasks/todo/copilot_prompt_",
"artifact_check": "file_prefix",
"expected_behavior": "os-architect classifies as Cat4 Gap Fill, runs audit, proposes Path C, dispatches os-evolution-planner to write task plan + copilot_prompt file",
"failure_mode": "agent routes to wrong category or fails to dispatch os-evolution-planner"
}
Three fast verification cases to confirm the skill itself is working:
Smoke 1 — Heartbeat check: Run heartbeat only, confirm HEARTBEAT_OK in output.
Expected: heartbeat.md non-empty, contains HEARTBEAT_OK. Time: <30s.
Smoke 2 — Single scenario dry run: Run TEST-1 (Path C gap fill). Confirm output file
is >100 lines. Time: <3 min.
Smoke 3 — HANDOFF_BLOCK field scan: On an existing output file, run the 7-field grep. Confirm all 7 fields found. Time: <5s.
Output files must be >100 lines: A Copilot CLI call that returns <50 lines usually means
the model hit a refusal, the system prompt was too long, or the heartbeat was skipped.
Always heartbeat first and always check wc -l before running artifact checks.
Single-shot simulation ≠ real dispatch: os-architect in simulation mode cannot write files to disk (no Write tool access during simulation). Artifact checks for Path B/C test whether the agent PROPOSES the correct files in its output, not whether they exist on disk. Real file-existence checks only apply when os-architect is run with full tool access.
HANDOFF_BLOCK field order matters for grep: Use grep -q "FIELD:" not grep -q "FIELD" —
otherwise partial matches on word fragments will produce false positives.
Confidence model check is order-sensitive: The clarifying question must appear BEFORE any
audit output. Line-number comparison is required; simple grep -q is insufficient.
temp/ files are ephemeral — distinguish shell restart from crash: If a run was
interrupted by a shell restart and temp/copilot_output_*.md is missing, set
STATUS: intentional_pause, VERDICT: PARTIAL (inconclusive) — the run never completed.
If the file is present but < 50 lines AND run_agent.py returned non-zero, set
STATUS: crashed, VERDICT: FAIL — the agent halted unexpectedly. Never report a
silent crash as PARTIAL.
OUTPUTS field path normalization: HANDOFF_BLOCK OUTPUTS lists paths relative to project
root. Normalize before checking (strip leading ./, resolve ~). A path mismatch between
declared and actual is a schema drift signal, not a file-missing signal.
Category 5 tests produce two sequential dispatches: When verifying Category 5 output, check that two separate PATH / TARGET pairs appear in HANDOFF_BLOCK, not one.
tools
Ingests repository files into the ChromaDB vector store. Builds or updates the vector index from a manifest or directory scan using ingest.py. Use when new files need to be indexed or the vector store is out of date. <example> user: "Index these new plugin files into the vector database" assistant: "I'll use vector-db-ingest to add them to the vector store." </example> <example> user: "The vector store is missing recent files -- update it" assistant: "I'll use vector-db-ingest to re-index the changes." </example>
data-ai
Removes stale and orphaned chunks from the ChromaDB vector store for files that have been deleted or renamed. Use after files are removed or moved to keep the vector index in sync with the filesystem. <example> user: "Clean up the vector store after I deleted some files" assistant: "I'll use vector-db-cleanup to remove orphaned chunks." </example> <example> user: "The vector database has chunks for files that no longer exist" assistant: "I'll run vector-db-cleanup to prune them." </example>
testing
Audit Vector DB coverage -- compares the live filesystem manifest against the ChromaDB index to identify coverage gaps.
development
3-Phase Knowledge Search strategy for the RLM Factory ecosystem. Auto-invoked when tasks involve finding code, documentation, or architecture context in the repository. Enforces the optimal search order: RLM Summary Scan (O(1)) -> Vector DB Semantic Search -> Grep/Exact Match. Never skip phases.