plugins/agent-agentic-os/skills/os-experiment-log/SKILL.md
Maintains a persistent, folder-based log of all agentic-os experiment runs. Each run writes one dated file to context/experiment-log/ and updates index.md. Supports five source types: verifier (qualitative), tester (qualitative), orchestrator (numeric), planner (qualitative), survey (mixed). Handles both numeric results (eval scores, KEEP/DISCARD, delta) and qualitative results (PASS/FAIL/PARTIAL, gap analysis). Use after any experiment run to persist findings before temp/ is cleared.
npx skillsauth add richfrem/agent-plugins-skills os-experiment-logInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The experiment log is the unified cross-cutting record for all agentic-os experiments.
One file per run, all files in context/experiment-log/, with index.md as a
queryable table of all runs.
context/experiment-log/
index.md ← one row per run (date, source, target, verdict)
2026-04-25-verifier-os-architect-round1.md ← from os-evolution-verifier
2026-04-25-tester-os-architect.md ← from os-architect-tester
2026-04-25-os-improvement-loop-os-eval-runner.md ← from os-improvement-loop
2026-04-25-planner-0024.md ← from os-evolution-planner
2026-04-25-survey-session.md ← from post_run_survey
Agents must check result_type in a log entry's header before parsing it:
| --source-type | Produced by | result_type | Key fields |
|---|---|---|---|
| verifier | os-evolution-verifier | qualitative | PASS/PARTIAL/FAIL counts, HANDOFF_BLOCK validity |
| tester | os-architect-tester | qualitative | AC-1–4 pass/fail per scenario |
| orchestrator | os-improvement-loop | numeric | best_score, baseline, delta, KEEP/DISCARD counts |
| planner | os-evolution-planner | qualitative | workstream count, gaps identified |
| survey | post_run_survey | mixed | friction item count, north_star metric |
Numeric entries (result_type: numeric) carry quantitative metrics suitable for trending and charting.
Qualitative entries (result_type: qualitative) carry pass/fail verdicts and gap analysis prose.
Mixed entries (result_type: mixed) carry both — agents must check which fields are present before parsing.
Read the argument or invocation context to determine mode:
append --source-type TYPE: log a new run from a completed experimentquery <term>: search all files in context/experiment-log/ by keywordsummary: print aggregate stats across all runs, broken down by source type# After os-evolution-verifier run
python3 scripts/experiment_log.py append \
--source-type verifier \
--report temp/os-evolution-verifier/test-report.md \
--session-id 2026-04-25-round1 \
--target os-architect \
--triggered-by os-evolution-verifier
# After os-architect-tester run
python3 scripts/experiment_log.py append \
--source-type tester \
--report temp/test_report_consolidated.md \
--session-id 2026-04-25-tester \
--target os-architect \
--triggered-by os-architect-tester
# After os-improvement-loop run (numeric — has score delta)
python3 scripts/experiment_log.py append \
--source-type orchestrator \
--report temp/logs/run-log.md \
--session-id 2026-04-25-os-eval-runner \
--target os-eval-runner \
--triggered-by os-improvement-loop
# After os-evolution-planner writes a task plan
python3 scripts/experiment_log.py append \
--source-type planner \
--report tasks/todo/0024-plan.md \
--session-id 0024 \
--target os-eval-runner \
--triggered-by os-evolution-planner
# After a post-run survey
python3 scripts/experiment_log.py append \
--source-type survey \
--session-id 2026-04-25-session \
--target session \
--triggered-by human
# Query by term
python3 scripts/experiment_log.py query T2-D
python3 scripts/experiment_log.py query FAIL
python3 scripts/experiment_log.py query numeric
# Aggregate summary
python3 scripts/experiment_log.py summary
After append:
tail -5 context/experiment-log/index.md
Report: "Logged to context/experiment-log/<filename>. Index updated."
After query: relay matching file names and their header blocks (date, source, target, verdict).
After summary: print the per-source-type breakdown verbatim.
Each file has a YAML-like header fence followed by the full report:
---
type: verifier
result_type: qualitative
date: 2026-04-25 15:12
session_id: 2026-04-25-round1
source: os-evolution-verifier
target: os-architect
verdict: 8P/0Pa/0F of 8
---
## Experiment — 2026-04-25 15:12 | verifier | os-architect
| Field | Value |
...
[full report content]
### Actions Taken
_[fill in: spec fix, new eval, new skill]_
Smoke 1 — Append verifier: Run python3 scripts/experiment_log.py append --source-type verifier.
Confirm new .md file appears in context/experiment-log/ and index.md has a new row.
Smoke 2 — Query: Run python3 scripts/experiment_log.py query PASS.
Confirm output lists at least one matching file with its header.
Smoke 3 — Summary by type: Run python3 scripts/experiment_log.py summary.
Confirm output shows [verifier], [orchestrator] etc. sections with correct run counts.
Never parse result_type: mixed with numeric-only logic: The survey source type
contains both friction prose and numeric north_star values. Always check result_type
in the file header before assuming field presence.
temp/ is ephemeral: Call append immediately after a run completes, before any
shell restart. The script exits with an error if the report file is missing rather than
appending empty data.
Actions Taken is human-filled: The script writes a placeholder. An experiment log without response actions is an audit trail, not a learning record. Fill it in before the next run.
Duplicate index rows: If append is called twice for the same session, two rows
appear in index.md. This is intentional (the file is append-only) but worth noting
when querying.
data-ai
Task management agent. Auto-invoked for task creation, status tracking, and kanban board operations using Markdown files across lane directories. V2 enforces Kanban Sovereignty constraints preventing manual task file edits.
development
Create, audit, repair, and document cross-platform symlinks that work correctly on both Windows and macOS/Linux. Use this skill whenever the user mentions symlinks, symbolic links, junction points, .gitconfig symlinks, broken links after git pull, cross-platform path issues, or needs help with ln -s equivalents on Windows. Also trigger when the user reports that files are missing or wrong after switching between Mac and Windows machines using Git. This skill solves the common problem where symlinks committed on macOS show up as plain text files on Windows (and vice versa) because of Git's core.symlinks setting or missing Developer Mode / elevated permissions. **IMPORTANT FOR WINDOWS USERS:** Developer Mode must be enabled before creating symlinks. Without it, Git will check out symlinks as plain-text files or hardlinks, breaking cross-platform workflows.
development
Interactively prepares a targeted Red Team Review package. It conducts a brief discovery interview to determine the threat model, generates a strict security auditor prompt, compiles a manifest of relevant project files, and bundles them into a single Markdown artifact or ZIP archive ready for an external LLM (like Grok, ChatGPT, or Gemini) or a human reviewer.
tools
Reduces AI agent context bloat across three dimensions: (1) duplicate skill deduplication — clears stale agent directory copies since the IDE already reads from plugins/ directly; (2) instruction file optimization — rewrites CLAUDE.md, GEMINI.md, or .github/copilot-instructions.md to under ~80 lines, keeping only rules that directly change agent behaviour; (3) session token efficiency — guidance on cheap subagent delegation, context compounding across turns, and session hygiene. Trigger with "optimize context", "reduce context bloat", "deduplicate skills", "trim CLAUDE.md", "trim GEMINI.md", "fix my context usage", "why are my skills loading twice", "how do I reduce token usage", or "clean up agent directories".