plugins/agent-agentic-os/skills/os-eval-backport/SKILL.md
Reviews a completed os-eval-runner lab run and backports approved changes to master plugin sources. Trigger with "backport the eval results", "review the lab run", "apply eval improvements to master", "check what the eval agent changed".
npx skillsauth add richfrem/agent-plugins-skills os-eval-backportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are the Lab-to-Master Handoff Agent. You review what an eval agent changed in a lab
(test) repo, assess each change, and apply approved ones to the canonical master sources in
agent-plugins-skills.
Never blind-copy. Read each diff, understand why the agent made the change, then edit master files deliberately. Lab repos contain real file copies; master sources use hub-and-spoke symlinks — you edit only the canonical source.
Q1 — Lab repo path?
The local path to the test repo where the eval ran (e.g. <USER_HOME>/Projects/test-link-checker-eval).
Q2 — Master plugin path?
The canonical plugin path in agent-plugins-skills (e.g. .agents/skills/link-checker).
Q3 — Baseline commit?
The git SHA of the baseline commit in the lab repo. Look for a commit starting with
baseline: in git log. If not provided: run git log --oneline in the lab repo and show it.
Confirm before proceeding:
Lab repo: /path/to/test-repo
Master plugin: plugins/<plugin-name>
Baseline commit: <sha> ("baseline: initial evaluation snapshot")
ls <lab-repo>/LOG_PROGRESS.md
cat <lab-repo>/LOG_PROGRESS.md
ls <lab-repo>/temp/logs/
Read the progress table first to understand the iteration history at a glance. Then read the run log for specific technical decisions. Note:
cd <lab-repo>
git log --oneline <baseline-commit>..HEAD
git diff <baseline-commit> HEAD --name-only
git diff <baseline-commit> HEAD
For each changed file, note what changed, why (from the run log), and whether it generalizes to master or was eval-specific.
Produce an assessment table for the user before applying anything:
| File | Change summary | Verdict | Reason |
|:---|:---|:---|:---|
| link-checker/skills/link-checker-agent/SKILL.md | Added --dry-run clarification | ACCEPT | Factually correct, improves clarity |
| link-checker/skills/link-checker-agent/evals/evals.json | Added eval-8 (ambiguous match) | ACCEPT | Good coverage gap |
| .agents/skills/os-eval-runner/evaluate.py | Changed exit code logic | REVIEW | Needs testing against master version |
Verdicts:
Present this table and get explicit approval before applying any change.
For each ACCEPT or ADAPT that the user approves:
cd <APS_ROOT>
git status
git add plugins/<plugin>/...
git commit -m "backport(<plugin>): <summary of accepted changes>"
If the lab agent is still running or recently completed, ask it targeted questions to surface operational knowledge that won't appear in diffs or logs. This is how eval infrastructure improves — the agent that ran the loop has first-hand friction data the backport reviewer can't see.
Ask the user to relay these questions (or ask directly if in the same session):
Always ask:
copilot_proposer_prompt.md when you did second-order mutations? Paste the full evolved file."Ask if the loop stalled: 4. "When you used Step B.2 (web research or Copilot brainstorm), what did you search for and what was the result?" 5. "What bridge words did you discover? Add them to the Trap Warning section if not already there."
Ask if the environment was reset mid-run: 6. "What happened to the baseline state? Was the Cold Start protocol sufficient to recover?"
Incorporate any new operational findings into the relevant templates and skills before Phase 6.
Report to the user:
Every completed backport session produces knowledge worth preserving. Two destinations, two scopes:
Check whether the Agentic OS is initialized in the master repo:
ls context/kernel.py 2>/dev/null && echo "OS present" || echo "OS absent"
If OS is present — delegate to os-memory-manager to write the dated session log:
Invoke os-memory-manager to write a session log for the eval backport session just completed.
Include: skill optimized, baseline vs final score, files backported, changes rejected and why,
and any snags or non-obvious findings from the run log or self-assessment survey.
This writes to context/memory/YYYY-MM-DD.md — tracked in git, not gitignored like temp/.
If OS is absent — write the session log directly:
mkdir -p context/memory
File: context/memory/YYYY-MM-DD.md using this template:
# Session Log: YYYY-MM-DD — Eval Backport: <skill-name>
## What Was Done
- Optimized <skill> from score <baseline> → <final> over <N> iterations
- Backported: [list of accepted files and what changed]
- Rejected: [list with reasons]
## Snags Encountered
- [Any errors, workarounds, or unexpected behaviors from the run log]
## Key Decisions
- [Any ADAPT choices — what was changed from the lab version and why]
## Open Items
- [ ] [Follow-up rounds, coverage gaps, improvements to evals or skill]
Apply a non-obvious filter before writing anything. Ask:
"Would a future agent following the eval workflow get burned by not knowing this?"
Write a memory entry only if the session produced at least one of:
Skip memory promotion for:
If the filter passes, write to the agent's memory directory using the feedback type:
File: memory/feedback_eval_<skill-name>_<topic>.md
---
name: feedback_eval_<skill-name>_<topic>
description: <one-line hook for MEMORY.md index>
type: feedback
---
<rule/finding>
**Why:** <what happened that surfaced this>
**How to apply:** <when this matters in future eval runs>
Then add a pointer line to MEMORY.md.
If the OS is initialized and the non-obvious filter passed, also ask os-memory-manager to promote the finding as a long-term fact to context/memory.md with a deduplication ID.
| Lab file | Master source |
|:---|:---|
| <plugin>/skills/<skill>/SKILL.md | plugins/<plugin>/skills/<skill>/SKILL.md |
| <plugin>/skills/<skill>/evals/evals.json | plugins/<plugin>/skills/<skill>/evals/evals.json |
| <plugin>/skills/<skill>/references/*.md | plugins/<plugin>/skills/<skill>/references/*.md |
| <plugin>/scripts/*.py | plugins/<plugin>/scripts/*.py |
| .agents/skills/os-eval-runner/ (if patched) | <SKILL_PATH> |
The master uses hub-and-spoke symlinks. Only the canonical source files listed above need updating — deployed environments sync from master automatically.
data-ai
Task management agent. Auto-invoked for task creation, status tracking, and kanban board operations using Markdown files across lane directories. V2 enforces Kanban Sovereignty constraints preventing manual task file edits.
development
Create, audit, repair, and document cross-platform symlinks that work correctly on both Windows and macOS/Linux. Use this skill whenever the user mentions symlinks, symbolic links, junction points, .gitconfig symlinks, broken links after git pull, cross-platform path issues, or needs help with ln -s equivalents on Windows. Also trigger when the user reports that files are missing or wrong after switching between Mac and Windows machines using Git. This skill solves the common problem where symlinks committed on macOS show up as plain text files on Windows (and vice versa) because of Git's core.symlinks setting or missing Developer Mode / elevated permissions. **IMPORTANT FOR WINDOWS USERS:** Developer Mode must be enabled before creating symlinks. Without it, Git will check out symlinks as plain-text files or hardlinks, breaking cross-platform workflows.
development
Interactively prepares a targeted Red Team Review package. It conducts a brief discovery interview to determine the threat model, generates a strict security auditor prompt, compiles a manifest of relevant project files, and bundles them into a single Markdown artifact or ZIP archive ready for an external LLM (like Grok, ChatGPT, or Gemini) or a human reviewer.
tools
Reduces AI agent context bloat across three dimensions: (1) duplicate skill deduplication — clears stale agent directory copies since the IDE already reads from plugins/ directly; (2) instruction file optimization — rewrites CLAUDE.md, GEMINI.md, or .github/copilot-instructions.md to under ~80 lines, keeping only rules that directly change agent behaviour; (3) session token efficiency — guidance on cheap subagent delegation, context compounding across turns, and session hygiene. Trigger with "optimize context", "reduce context bloat", "deduplicate skills", "trim CLAUDE.md", "trim GEMINI.md", "fix my context usage", "why are my skills loading twice", "how do I reduce token usage", or "clean up agent directories".