Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

richfrem/os-eval-runner

Name: os-eval-runner
Author: richfrem

plugins/agent-agentic-os/skills/os-eval-runner/SKILL.md

npx skillsauth add richfrem/agent-plugins-skills os-eval-runner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Skill Improvement Evaluator

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts.

Ownership Boundary (Critical)

What os-eval-runner owns (permanent, version-controlled with this skill)

Scoring scripts: ./scripts/evaluate.py, ./scripts/eval_runner.py
Scaffold script: ./scripts/init_autoresearch.py
Templates: ./assets/templates/autoresearch/ (program, evals, results, proposer prompt)

What lives with the target (deployed per experiment)

All experiment state deploys alongside the target (e.g. <experiment-dir>/references/program.md, <experiment-dir>/evals/evals.json, <experiment-dir>/evals/results.tsv). You MUST read the spec from <experiment-dir>/references/program.md and NOT fall back to engine-local config templates.

Phase 0: Intake Interview

Run this interview before starting any loop or evaluation. If enough information is provided in the initial prompt, skip the redundant questions.

Q1 — What target skill are you evaluating? (Provide path to skill folder)
Q2 — Where should the experiment files live? (Defaults to target skill directory)
Q2b — What metric are you optimizing? (quality_score, f1, precision, recall, or heuristic)
Q3 — What mode? (Loop mode for autonomous improvement vs QA mode for single diff validation)
Q4 — (Loop mode) How many iterations? (Default: NEVER STOP)
Q5 — Does evals.json exist? (If missing, scaffold from template)
Q6 — Does program.md exist? (If missing, scaffold from template)
Q7 — Does a baseline score exist? (If missing, run evaluate.py with --baseline)

Two Modes: Summarized

Mode 1: Autoresearch Loop: Autonomous iterative improvement. The agent identifies failure types, requests mutations via external proposer CLI (Copilot/Gemini), and runs the eval gate iteratively until the budget or target score is met.
Mode 2: Single-shot QA: Simple gate validation. Evaluates one specific proposed diff against the baseline and decides KEEP (exit 0) or DISCARD (revert, exit 1).

Stage Pointers & Reference Protocols

Setup: Start a New Experiment — 4-step setup and re-baselining procedure.
Mode 1: Autoresearch Loop Protocol — Proposer cycles, prompt mutations, and evaluation loop.
Mode 2: Single-shot QA Protocol — Context acquisition, reverts, and reporting.
Phase 2b: Overfitting Gate — Holdout set overfitting checks and forced discard logic.
Phase 5: Self-Assessment Survey — Mandatory evaluator survey guidelines.

Smoke Test & Gotchas

Smoke Test

Scaffold an experiment: python3 ./scripts/init_autoresearch.py --experiment-dir temp/test-exp --mutation-target SKILL.md.
Establish baseline: python3 ./scripts/evaluate.py --skill temp/test-exp --baseline --desc "smoke test".
Validate exit code: Assert results.tsv is created, and running evaluate.py returns 0.

Gotchas

Subjective Simulation: Avoid "mentally simulating" routing accuracy. Subjective audits are strictly banned; run Python evaluation scripts.
Missing Holdout: Starting loops without holdout prompts. This bypasses the overfitting gate, rendering the results invalid.
Keywords Footgun: Adding too many triggers to frontmatter. This dilutes semantic discrimination and degrades overall router precision.

richfrem/os-eval-runner

plugins/agent-agentic-os/skills/os-eval-runner/SKILL.md

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts. Use when the user says "evaluate this skill", "run autoresearch loop on", "optimize this skill", "run the eval loop", or when another agent proposes a change and needs validation.

4 stars

development

Updated Jul 11, 2026

$ install --global

skillsauth

npx skillsauth add richfrem/agent-plugins-skills os-eval-runner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 11, 2026, 6:19 AM267.6s37 files scanned

SKILL.md

name:: os-eval-runner
plugin:: agent-agentic-os
description:: >
allowed-tools:: Read, Write, Edit, Bash, Glob, Grep

Skill Improvement Evaluator

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python evaluation scripts.

Ownership Boundary (Critical)

What os-eval-runner owns (permanent, version-controlled with this skill)

Scoring scripts: ./scripts/evaluate.py, ./scripts/eval_runner.py
Scaffold script: ./scripts/init_autoresearch.py
Templates: ./assets/templates/autoresearch/ (program, evals, results, proposer prompt)

What lives with the target (deployed per experiment)

Phase 0: Intake Interview

Run this interview before starting any loop or evaluation. If enough information is provided in the initial prompt, skip the redundant questions.

Q1 — What target skill are you evaluating? (Provide path to skill folder)
Q2 — Where should the experiment files live? (Defaults to target skill directory)
Q2b — What metric are you optimizing? (quality_score, f1, precision, recall, or heuristic)
Q3 — What mode? (Loop mode for autonomous improvement vs QA mode for single diff validation)
Q4 — (Loop mode) How many iterations? (Default: NEVER STOP)
Q5 — Does evals.json exist? (If missing, scaffold from template)
Q6 — Does program.md exist? (If missing, scaffold from template)
Q7 — Does a baseline score exist? (If missing, run evaluate.py with --baseline)

Two Modes: Summarized

Mode 1: Autoresearch Loop: Autonomous iterative improvement. The agent identifies failure types, requests mutations via external proposer CLI (Copilot/Gemini), and runs the eval gate iteratively until the budget or target score is met.
Mode 2: Single-shot QA: Simple gate validation. Evaluates one specific proposed diff against the baseline and decides KEEP (exit 0) or DISCARD (revert, exit 1).

Stage Pointers & Reference Protocols

Setup: Start a New Experiment — 4-step setup and re-baselining procedure.
Mode 1: Autoresearch Loop Protocol — Proposer cycles, prompt mutations, and evaluation loop.
Mode 2: Single-shot QA Protocol — Context acquisition, reverts, and reporting.
Phase 2b: Overfitting Gate — Holdout set overfitting checks and forced discard logic.
Phase 5: Self-Assessment Survey — Mandatory evaluator survey guidelines.

Smoke Test & Gotchas

Smoke Test

Scaffold an experiment: python3 ./scripts/init_autoresearch.py --experiment-dir temp/test-exp --mutation-target SKILL.md.
Establish baseline: python3 ./scripts/evaluate.py --skill temp/test-exp --baseline --desc "smoke test".
Validate exit code: Assert results.tsv is created, and running evaluate.py returns 0.

Gotchas

Subjective Simulation: Avoid "mentally simulating" routing accuracy. Subjective audits are strictly banned; run Python evaluation scripts.
Missing Holdout: Starting loops without holdout prompts. This bypasses the overfitting gate, rendering the results invalid.
Keywords Footgun: Adding too many triggers to frontmatter. This dilutes semantic discrimination and degrades overall router precision.

Related Skills

richfrem/issue-worktree-agent

testing

VerifiedTrustedCommunity

Skill for creating and managing isolated git worktrees (`.worktrees/issue-NNN`) for issue execution branches. USE ONLY when setting up or cleaning up isolated git worktrees for specific issue execution. DO NOT USE for managing local task files (use `task-agent`) or escalating tasks to issues (use `github-issue-backlog-agent`).

4SKILL.mdUpdated Jul 29, 2026

richfrem/issue-worktree-agent

richfrem/issue-pr-lifecycle-agent

data-ai

VerifiedTrustedCommunity

Skill for orchestrating the end-to-end GitHub issue lifecycle flow: Issue -> Worktree -> Implementation -> PR Creation -> Resolution Closure. USE ONLY when running or dry-running full lifecycle orchestration for resolving an issue with a PR. DO NOT USE for isolated worktree management only (use `issue-worktree-agent`) or logging issues (use `github-issue-agent`).

4SKILL.mdUpdated Jul 29, 2026

richfrem/issue-pr-lifecycle-agent

richfrem/github-issue-prioritizer

tools

VerifiedTrustedCommunity

Automatically ranks GitHub issues (P0-P3) based on friction tier, frequency, and blockages, synchronizing priority labels and GitHub Projects v2 custom fields.

4SKILL.mdUpdated Jul 29, 2026

richfrem/github-issue-prioritizer

richfrem/github-issue-backlog-agent

testing

VerifiedTrustedCommunity

Bridge skill for escalating ephemeral local task scratchpad items (`tasks/*.md`) into durable, taxonomy-validated, evidence-rich GitHub Issues. USE ONLY when promoting a single-session local task into durable repository backlog. DO NOT USE for managing local kanban boards (use `task-agent` instead) or directly querying/commenting on issues (use `github-issue-agent` instead).

4SKILL.mdUpdated Jul 29, 2026

richfrem/github-issue-backlog-agent

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/richfrem/agent-plugins-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-plugins-skills/plugins/agent-agentic-os/skills/os-eval-runner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

richfrem/agent-plugins-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT