skills/auto-improve/SKILL.md
Autonomously improve any of five targets: skills (SKILL.md prompt optimization via binary eval loops), memories (audit for staleness, gaps, redundancy, and inconsistencies then rewrite), AI agents (agents/*.md prompt optimization via eval loops), documentation (repo docs like `AGENTS.md`, `PLAN.md`, `SPEC.md`, `SOUL.md`, `PRINCIPLES.md`, `DESIGN.md`, `README.md`, `ARCHITECTURE.md`, `TESTS.md`, `SETUP.md`, `RUNBOOK.md`, `CHANGELOG.md`, `SECURITY.md`, `OVERVIEW.md`, `FAQ.md`, `DECISIONS.md`, `DEPENDENCIES.md`, `CONTRIBUTING.md`, `TESTING.md`, `runbooks/**/*.md`, and `docs/**/*.md` optimized via eval loops), and conversations (Hermes-pattern background review that harvests user persona, preferences, and reusable workflows from the current conversation and persists them as memory files or new skills). Uses Karpathy-style autoresearch methodology for eval-loop targets: run, score, mutate one thing at a time, keep improvements, discard regressions, never stop. Extends that loop with hyperagent-style metacognition: the system should improve not just the target, but the way it generates future improvements, using persistent memory, stepping-stone archives, causal hypotheses, and transferable lessons across runs. Trigger from evidence in recent work: the files changed, failures encountered, repeated user corrections, patterns in agent behavior, and gaps revealed by the latest task. Do not wait for the user to name a target. Choose the highest-leverage improvement target or targets automatically; improve one or many as justified by the evidence. Outputs: improved target file, results.tsv score log, changelog.md mutation log, persistent self-improvement memory, stepping-stone archive, and live dashboard.html for eval-loop targets; memory/skill files for conversation reviews.
npx skillsauth add alvarovillalbaa/agent-suite auto-improveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Most skills, agents, and documentation work about 70% of the time. The other 30% produces inconsistent, shallow, or wrong output. The fix is not a full rewrite — it is letting an autonomous loop run the target repeatedly, score every output against binary criteria, tighten the prompt until that 30% disappears, and keep a complete research log of every mutation attempted.
Memories are different: they degrade silently. Facts go stale, gaps accumulate, entries duplicate. The fix is a structured audit followed by targeted rewrites.
This skill handles both patterns under one entry point.
It is not request-routed. The trigger is what the agents actually did: files changed, mistakes repeated, user directions clarified, workflows that felt awkward, docs that were missing, and gaps between expected behavior and actual behavior.
It should behave like a lightweight hyperagent, not a one-shot optimizer:
STOP. Do not touch any file until you have completed this discovery pass:
Inspect the latest evidence from the work that just happened:
Build a candidate improvement list across these target types:
skill — a SKILL.md file whose instructions caused weak or awkward executionagent — an agent definition under agents/ whose routing, trigger text, or workflow was offdocumentation — repo docs like AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, DESIGN.md, README.md, ARCHITECTURE.md, TESTS.md, SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md, OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md, CONTRIBUTING.md, TESTING.md, runbooks/**/*.md, or docs/**/*.mdmemory — memory files under ~/.claude/projects/*/memory/ that are stale, contradictory, or missing key durable factsconversation — harvest durable memory or reusable workflows from the current conversationRank candidates by:
Choose the smallest set of targets that fixes the real problem:
AGENTS.md, PLAN.md, SPEC.md, DESIGN.md, or a runbook.Route each selected target to the appropriate sub-flow below. Process multiple targets one at a time, highest leverage first.
If the evidence does not justify any durable improvement, stop and make no changes.
Find all memory files:
~/.claude/projects/*/memory/MEMORY.md~/.claude/projects/*/memory/*.md (exclude MEMORY.md itself)Read every file in full before forming any judgment.
For every memory file, check:
Staleness — Does the file contain relative dates ("last Thursday", "recently", "this quarter") that have no absolute anchor? Does it describe a state (a role, a project, a decision) that may no longer be true? Flag these for verification or removal.
Gaps — Is there a category of user knowledge (role, preferences, project context, key decisions) that the memory system clearly should have but doesn't? Note what is missing and why it matters.
Redundancy — Do two or more files encode the same fact? Is any file a strict subset of another? Mark duplicates for consolidation.
Inconsistencies — Do any two files contradict each other? Does a file's type field mismatch its content? Does the MEMORY.md index reference a file that no longer exists, or omit one that does? Flag every conflict.
Create auto-improve-memory/audit-report.md with this structure:
# Memory Audit Report — YYYY-MM-DD
## Summary
- Files reviewed: N
- Issues found: N (staleness: N, gaps: N, redundancy: N, inconsistencies: N)
## Staleness
- [filename]: [what is stale and why]
## Gaps
- [what is missing]: [why it matters]
## Redundancy
- [filename A] duplicates [filename B]: [what overlaps]
## Inconsistencies
- [filename A] vs [filename B]: [what conflicts]
## Recommended actions
1. [action] — [file]
2. ...
For each issue found, take the recommended action:
Rewrite one file at a time. After each rewrite, update the MEMORY.md index if needed.
Create auto-improve-memory/changelog.md:
## [filename] — [action taken]
**Issue:** [what was wrong]
**Change:** [what was rewritten or added]
**Reason:** [why this improves the memory system]
Report:
This sub-flow implements the Hermes-style self-improving memory pattern inspired by the NousResearch agent. Instead of waiting for explicit memory commands, the system periodically intercepts a user turn, spawns a background job, reviews the conversation, and saves only the durable things worth keeping — so the agent grows with the user without distracting from the main task.
Trigger this sub-flow automatically every 10 conversation turns. The trigger point is the user's message on turn 10, 20, 30, etc. That user message is intercepted and a new background job is spawned.
The main response continues normally. The review must happen asynchronously so it does not delay, distract, or derail the agent handling the active task.
Also trigger this sub-flow whenever the latest work reveals durable preferences, workflow expectations, or project facts that should persist beyond the current task, even if the 10-turn cadence has not been hit yet.
If the discovery pass above identifies conversation-derived memory as a target, run it immediately on the full conversation so far.
Read the full conversation above. Apply this exact review prompt:
Review the conversation above and consider saving to memory if appropriate.
Focus on:
- Has the user revealed things about themselves — their persona, desires, preferences, or personal details worth remembering?
- Has the user expressed expectations about how you should behave, their work style, or ways they want you to operate?
If something stands out, save it using the memory tool. If nothing is worth saving, just say "Nothing to save." and stop.
Do not paraphrase or soften the prompt. Use it as written.
For each piece of information worth saving, decide which durable destination it belongs to:
| Finding type | Target |
|---|---|
| User persona, goals, background, personal preferences | Personal memory (USER.md-equivalent) |
| Work style expectations, behavior instructions, corrections | Personal/operating memory (USER.md-equivalent) |
| Project context, technical decisions, timelines, constraints, repo facts | Technical memory (MEMORY.md-equivalent) |
| Repeated workflow or technique the user keeps applying | New skill file |
In this repository's memory system, that means:
USER.md.MEMORY.md.Do not save what is already in memory. Before writing, check existing memory files for duplicates or superseded entries.
For memory targets — preserve the Hermes destination model:
USER.md-equivalent destinationMEMORY.md-equivalent destinationWhen this repo uses discrete memory files instead of literal USER.md / MEMORY.md, map the content into the equivalent structure without losing the distinction between personal memory and technical memory.
For structured memory files — write using the standard frontmatter format:
---
name: [descriptive name]
description: [one-line description for MEMORY.md index]
type: user | feedback | project | reference
---
[content — for feedback/project types: rule/fact, then **Why:** and **How to apply:** lines]
Then add or update the pointer in MEMORY.md.
For skill targets — only create a new skill if the pattern is genuinely reusable across sessions (not just a one-off technique). Use the standard SKILL.md frontmatter with name and description.
When running as a background job, do not interrupt the main conversation. Silently save the files. After the main response is delivered, append a one-line status:
(Background memory review: saved [N] items — [brief list of what was saved])
If nothing was saved: no status line needed.
USER.md-equivalent memory; technical/project facts belong in MEMORY.md-equivalent memory; reusable procedures belong in skills.STOP. Do not run any experiments until all fields below are grounded in evidence from the latest work. Derive them; only ask the user if a critical ambiguity cannot be resolved from context.
.md, or documentation .mdThe loop is not only trying to improve the current target. It is also trying to improve the improvement process itself.
That means:
Do not assume the evaluation task and the self-modification task are perfectly aligned. A good output on one task does not automatically imply a good mutation strategy. Explicitly reason about the meta-level process.
Before changing anything, read and understand the target completely:
references/ that the skill links toDo NOT skip this. You need to understand what the target does before you can improve it.
Convert the intended behavior plus the evidence package into a structured test. Every check must be binary — pass or fail, no scales.
Format each eval as:
EVAL [N]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like — be specific]
Fail condition: [What triggers a "no"]
Max score: [number of evals] × [test inputs] × [runs per input]
Create auto-improve-[name]/dashboard.html — a self-contained HTML file with inline CSS and JS. Open it immediately: open auto-improve-[name]/dashboard.html.
The dashboard must:
results.jsonUpdate auto-improve-[name]/results.json after every experiment. Format:
{
"target": "[name]",
"status": "running",
"current_experiment": 3,
"baseline_score": 70.0,
"best_score": 90.0,
"memory_summary": [
"Mutation ordering mattered more than instruction count",
"Experiment 2 overcorrected and hurt routing precision"
],
"experiments": [
{
"id": 0,
"score": 14,
"max_score": 20,
"pass_rate": 70.0,
"status": "baseline",
"description": "original — no changes"
}
],
"eval_breakdown": [
{"name": "Eval name", "pass_count": 8, "total": 10}
]
}
When the run ends, set "status": "complete".
auto-improve-[name]/results.tsv with header rowauto-improve-[name]/[filename].baselineself_improvement_memory.md and archive/self_improvement_memory.md with any relevant lessons from prior runs on similar targetsresults.tsv format (tab-separated):
experiment score max_score pass_rate status description
0 14 20 70.0% baseline original — no changes
IMPORTANT: After baseline, do not pause for approval. Continue automatically if the target still shows meaningful failures or if the issue is high leverage. If baseline is already 90%+ and the remaining gap is minor, skip this target and move to the next candidate instead of optimizing for noise.
LOOP AUTONOMOUSLY. Do not pause between experiments unless you hit a real blocker or need unavailable information.
Each iteration:
Analyze failures. Which evals fail most? Read the actual outputs that failed. Identify the pattern: formatting issue, missing instruction, ambiguous directive?
Consult memory. Read self_improvement_memory.md and the best archive entries before proposing the next change. Check whether the current failure resembles a prior one, whether a previous mutation overcorrected, and whether two stepping stones should be combined.
Form ONE hypothesis. Pick one thing to change. Never change multiple things at once — you will not know what helped.
Make the mutation. Edit the target file with one targeted change. See target-specific mutation guide below.
Run all test inputs. Score every output against every eval.
Decide:
Update persistent memory. After every experiment, write down:
Archive stepping stones. Save every kept improvement and any especially informative near-miss into archive/ with a short note explaining why it matters. A stepping stone is any variant that teaches something reusable, not just the current winner.
Log the result in results.tsv and results.json.
Go back to step 1.
Stop only when:
If stuck: Re-read failing outputs. Try combining two previous near-miss mutations. Try removing things instead of adding. Simplification that maintains the score is a win.
After every experiment (kept or discarded), append to auto-improve-[name]/changelog.md:
## Experiment [N] — [keep/discard]
**Score:** [X]/[max] ([percent]%)
**Change:** [One sentence describing what was changed]
**Reasoning:** [Why this change was expected to help]
**Result:** [Which evals improved or declined]
**Remaining failures:** [What still fails, if anything]
Also update auto-improve-[name]/self_improvement_memory.md with synthesized memory entries, not raw logs. Each entry should capture:
## Memory [N] — [short title]
**Context:** [target type + failure pattern]
**Hypothesis:** [causal belief about what would help]
**Outcome:** [what happened]
**Interpretation:** [why this likely happened]
**Transferability:** [where else this lesson should apply]
**Next move:** [forward-looking plan]
This file is the persistent memory of the improvement process itself. It must be actively consulted in later experiments and in later runs on related targets.
When the loop ends, present:
If more selected targets remain, continue to the next one instead of treating the first target as the whole job.
Before moving to the next target, review whether any lessons from the finished target should transfer. If yes, write them into the next target's seeded memory and cite the source archive entry.
Good mutations:
Bad mutations:
Mutation scope: the body text of SKILL.md — instructions, anti-patterns, examples, ordering.
Dynamic content injection: If the skill depends on context that changes per repo or session, you can embed !command`` placeholders in SKILL.md to inject live shell output at invocation time — the model only ever sees the result, not the raw placeholder. This requires the skill's frontmatter to declare allowed-tools for every tool the command needs (e.g. allowed-tools: Bash(git branch --show-current)). Treat injected commands with the same scrutiny as postinstall scripts — they run with full shell permissions. Prefer read-only introspection (git, cat, jq) over commands with side effects.
Good mutations:
description trigger phrases so the agent is invoked at the right momentsWhen to use section to reduce false positives and false negativesBad mutations:
Mutation scope: the frontmatter description, When to use, commands/skills tables, workflow steps.
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, DESIGN.md, README.md, ARCHITECTURE.md, TESTS.md, SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md, OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md, CONTRIBUTING.md, TESTING.md, runbooks/**/*.md, docs/**/*.md)Good mutations:
code-documentation contract consistently: Core docs (README.md, ARCHITECTURE.md, TESTS.md), Conditional docs (SETUP.md, RUNBOOK.md, CHANGELOG.md, SECURITY.md), Rare docs (OVERVIEW.md, FAQ.md, DECISIONS.md, DEPENDENCIES.md), root instruction docs, and runbooks/runbooks/ or RUNBOOK.md when they are currently scattered across long docsBad mutations:
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, or DESIGN.md stale after the repo's operating model changesMutation scope: headings, ordering, wording, examples, checklists, cross-links, and stale-content removal inside the target documentation file.
AGENTS.md, PLAN.md, SPEC.md, SOUL.md, PRINCIPLES.md, or DESIGN.md when the missing rule is repo-wide rather than skill-specific.runbooks/ or RUNBOOK.md when the missing content is an operational workflow rather than a policy or concept.Every eval must be a yes/no question. Not a scale. Binary.
Why: Scales compound variability. Binary evals give a reliable signal across runs.
Good evals:
Bad evals:
Sweet spot: 3–6 evals. More than 6 and the target starts gaming the criteria instead of actually improving.
Max evals per target type:
All eval-loop targets produce files in auto-improve-[name]/:
auto-improve-[name]/
├── archive/ # stepping-stone variants and notes
├── dashboard.html # live browser dashboard (auto-refreshes every 10s)
├── self_improvement_memory.md # synthesized insights, causal hypotheses, transfer notes, next moves
├── results.json # data file powering the dashboard
├── results.tsv # score log (tab-separated)
├── changelog.md # mutation-by-mutation research log
└── [original-filename].baseline # original file before any changes
Memory audit produces:
auto-improve-memory/
├── audit-report.md # findings across all four audit dimensions
└── changelog.md # what was rewritten, created, or deleted
The improved target file is always saved back to its original location.
A good auto-improve run:
development
Use for frontend engineering work such as components, routes, state management, accessibility, performance, design-system integration, and browser-facing debugging or refactors.
development
This skill should be used when the user asks to write, update, review, scaffold, move, remove, or continuously improve documentation for code, folders, services, repos, workflows, architectural decisions, or operational processes. Trigger for inline docs, `README.md`, `ARCHITECTURE.md`, `TESTS.md`, `SETUP.md`, `RUNBOOK.md`, `CHANGELOG.md`, `SECURITY.md`, `OVERVIEW.md`, `FAQ.md`, `DECISIONS.md`, `DEPENDENCIES.md`, `AGENTS.md`, `PLAN.md`, `SPEC.md`, `SOUL.md`, `PRINCIPLES.md`, `DESIGN.md`, `logs/`, `lessons/`, `items/`, `fixes/`, `audits/`, `raw/`, `plans/`, `specs/`, `sources/`, `lib/`, `references/`, `cookbook/`, `knowledge/`, `runbooks/`, `research/`, `official-documentation/`, `context/`, MDX docs, JSDoc/TSDoc, docstrings, ADRs, post-mortems, migration guides, documentation cleanups, and documentation-impact reviews.
tools
Cross-cloud CLI-first cloud operations for AWS, Azure, and GCP. Use when the assistant needs to identify which cloud provider or multi-cloud estate a repo uses, deploy new resources or services, wire automatic deployments, inventory and optimize infrastructure, or diagnose and repair cloud failures entirely from the terminal, with explicit approval gates for high-cost, destructive, identity-sensitive, or hard-to-reverse changes. Covers AWS Amplify full-stack projects, serverless workloads (Lambda, API Gateway, Step Functions, SAM, CDK), and the full AWS database portfolio (RDS, Aurora, Aurora DSQL, DynamoDB, ElastiCache), as well as deep Azure references for diagnostics, storage, compute, compliance, identity, Foundry, and cross-cloud migrations.
development
Use for backend engineering work such as APIs, services, data models, persistence, queues, caching, auth, background jobs, and server-side debugging or refactors.