plugins/agent-scaffolders/skills/eval-autoresearch-fit/SKILL.md
Trigger with "evaluate autoresearch fit", "score this skill for karpathy loop", "is this a good autoresearch candidate", "assess autoresearch viability for", "which skills are best for autonomous loop optimization", "score skills for 3-file architecture", or when the user wants to determine if a skill is a good candidate for applying the Karpathy autoresearch autonomous optimization loop pattern.
npx skillsauth add richfrem/agent-plugins-skills eval-autoresearch-fitInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Assess whether a skill is a viable candidate for the Karpathy 3-File Autoresearch autonomous
optimization loop. Scores each skill on four dimensions, proposes what the 3-file architecture
would look like, and updates the canonical summary-ranked-skills.json via the update script.
The Karpathy autoresearch pattern requires three conditions simultaneously:
Skills that lack these properties cannot run an effective autonomous loop.
The canonical ranked skills list lives at:
plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
After every evaluation, update it with the update script (see Step 5).
Each dimension is scored 1-10. Max total = 40.
| Dimension | 10 (Best) | 1 (Worst) | |---|---|---| | Objectivity | Binary pass/fail or exact numeric output from a shell command | Purely subjective, requires human taste judgment | | Execution Speed | Completes in seconds | Requires 30+ min or human input | | Frequency of Use | Triggered multiple times per day | Rarely needed (monthly or less) | | Potential Utility | Prevents systemic failures or saves hours per session | Nice-to-have improvement |
Viability thresholds:
If $ARGUMENTS is a path to a directory containing SKILL.md, read it directly.
Otherwise find it by name from the repo root:
PROJECT_ROOT=$(git rev-parse --show-toplevel)
find "$PROJECT_ROOT/plugins" -name "SKILL.md" | grep "$ARGUMENTS" | head -5
Read the SKILL.md fully before scoring.
Reason through each dimension explicitly before assigning a number.
Objectivity (1-10)
Execution Speed (1-10)
Frequency of Use (1-10)
Potential Utility (1-10)
Determine the loop type:
Important: if a skill has both a script component and a prompt component, propose splitting into two separate loops. Label them Loop A (script) and Loop B (prompt). Score and barrier each separately.
The Spec (program.md):
What is the optimization goal? What constraints apply? What is the NEVER STOP directive?
The Mutation Target: Which single file does the agent modify per iteration? If the skill inherently requires multi-file changes, flag this as a barrier and propose how to isolate it.
The Evaluator (evaluate.py):
Note: this
evaluate.pyis a script you would write when implementing the autoresearch loop for the target skill — it is NOT part of this skill. This skill only describes what it would look like. When ready to build the loop, createevaluate.pyinside the target skill'sautoresearch/directory.
Produce the assessment in this format:
## Autoresearch Fit Assessment: [Skill Name]
**Plugin:** [plugin-name]
**Skill path:** [relative path from repo root]
### Scores
| Dimension | Score | Rationale |
|---|---|---|
| Objectivity | X/10 | [one line] |
| Execution Speed | X/10 | [one line] |
| Frequency of Use | X/10 | [one line] |
| Potential Utility | X/10 | [one line] |
| **TOTAL** | **X/40** | |
**Verdict: [HIGH / MEDIUM / LOW / NOT_VIABLE]**
**Loop type: [DETERMINISTIC / LLM_IN_LOOP / HYBRID]**
### Proposed 3-File Architecture
**Spec (`program.md`):**
> [2-3 sentences: optimization goal + constraints + NEVER STOP directive]
**Mutation Target:** `[path/to/file]`
**Evaluator command:**
```bash
[shell command that outputs a single number]
Deterministic: [YES / NO + explanation]
[1-2 sentences. If MEDIUM: what to address first.]
Then update the JSON using the update script:
```bash
DATA_JSON=$(git rev-parse --show-toplevel)/plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
SKILL_DIR=.
python "$SKILL_DIR/scripts/update_ranked_skills.py" \
--json-path "$DATA_JSON" \
--plugin <plugin> \
--skill <skill> \
--objectivity X --speed X --frequency X --utility X \
--verdict HIGH|MEDIUM|LOW|NOT_VIABLE \
--loop-type DETERMINISTIC|LLM_IN_LOOP|HYBRID \
--mutation-target "path/to/file" \
--evaluator-command "python evaluate.py ..." \
--barriers "Barrier 1" "Barrier 2" \
--eval-notes "Key insight from this evaluation" \
--status EVALUATED
DATA_JSON=$(git rev-parse --show-toplevel)/plugin-research/experiments/analyze-candidates-for-auto-reseaarch/skills/eval-autoresearch-fit/assets/resources/summary-ranked-skills.json
# List all entries with current status
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" --list
# Show a specific entry
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" \
--plugin agent-agentic-os --skill os-eval-runner --show
# List only PENDING entries (next batch to evaluate)
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" \
--list --filter-status PENDING
# Generate morning report (full ranked table + recommendation)
python ./scripts/update_ranked_skills.py --json-path "$DATA_JSON" --morning-report
When the user says "evaluate next batch" or "continue the list":
--list --filter-status PENDING to see remaining skillstotal_autoresearch_viabilityWhen a skill scores HIGH or MEDIUM, scaffold the actual autoresearch loop inside the target skill
using the autoresearch/ convention. This folder lives inside the target skill directory.
Directory convention (inside the target skill):
plugins/<plugin>/skills/<skill>/
SKILL.md ← mutation target (agent edits this each iteration)
autoresearch/ ← the loop lives here
program.md ← the spec (goal + constraints + NEVER STOP)
evaluate.py ← LOCKED evaluator (agent must never modify this)
results.tsv ← experiment ledger (one row per iteration)
tasks/ ← golden task fixtures (LLM_IN_LOOP skills only)
test-fixtures/ ← deterministic inputs (DETERMINISTIC skills only)
Why evaluate.py runs every iteration:
The loop is: agent mutates SKILL.md → run autoresearch/evaluate.py → record metric in
results.tsv → KEEP (commit) or DISCARD (git reset). evaluate.py is locked — the agent only
touches the mutation target.
Cost by loop type:
Implement DETERMINISTIC candidates first where possible (fast, free, many trials).
Scaffold steps for HIGH/MEDIUM verdicts:
autoresearch/ inside the target skill directoryprogram.md from the template (goal, metric, mutation target, NEVER STOP)evaluate.py implementing the evaluator command from the assessmentresults.tsv with header: commit\tmetric\tstatus\tdescriptiontest-fixtures/ with at least one deterministic inputtasks/ with at least one human-validated golden taskplugins/ from repo root, report path before proceedingtools
Ingests repository files into the ChromaDB vector store. Builds or updates the vector index from a manifest or directory scan using ingest.py. Use when new files need to be indexed or the vector store is out of date. <example> user: "Index these new plugin files into the vector database" assistant: "I'll use vector-db-ingest to add them to the vector store." </example> <example> user: "The vector store is missing recent files -- update it" assistant: "I'll use vector-db-ingest to re-index the changes." </example>
data-ai
Removes stale and orphaned chunks from the ChromaDB vector store for files that have been deleted or renamed. Use after files are removed or moved to keep the vector index in sync with the filesystem. <example> user: "Clean up the vector store after I deleted some files" assistant: "I'll use vector-db-cleanup to remove orphaned chunks." </example> <example> user: "The vector database has chunks for files that no longer exist" assistant: "I'll run vector-db-cleanup to prune them." </example>
testing
Audit Vector DB coverage -- compares the live filesystem manifest against the ChromaDB index to identify coverage gaps.
development
3-Phase Knowledge Search strategy for the RLM Factory ecosystem. Auto-invoked when tasks involve finding code, documentation, or architecture context in the repository. Enforces the optimal search order: RLM Summary Scan (O(1)) -> Vector DB Semantic Search -> Grep/Exact Match. Never skip phases.