Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ooiyeefei/self-improving-systems

Name: self-improving-systems
Author: ooiyeefei

skills/self-improving-systems/SKILL.md

npx skillsauth add ooiyeefei/ccc self-improving-systems

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Self-Improving Systems

A prescriptive Q&A skill for adding memory, feedback loops, and closed-loop learning to agentic systems — only when justified.

Headline message: most agents shouldn't have persistent memory.

Memory is a liability surface (drift, poisoning, debugging difficulty, GDPR/HIPAA exposure). Persistent memory is the second move, not the first. The skill's job is to filter ruthlessly so the user doesn't ship a mem0/Letta build for a problem that a 200-line conversation summary would solve.

The first 2 stages of the Q&A flow exist to stop most users from over-engineering. By the end of stage 2, ~60% of users will discover they want a state cache (or stateless RAG), not memory + learning. That's the win.

Quick Start

User just asks:

"Add memory to my agent"
"My agent keeps forgetting things — give it context management"
"Make my marketing agent learn from past campaigns"
"Should I use mem0 or Letta?"
"How do I set up closed-loop learning for my finance agent?"
"Build a self-improving HAZOP system"

Skill response (every time, in this order):

Stop. Apply the cache-vs-learning frame (Stage 1).
Run the 6-question need-memory rubric (Stage 2). <4 yes → exit the skill, recommend stateless + RAG.
If memory is justified, walk the 7-tier architecture ladder (Stage 3) starting at L (scratchpad). Escalate only when forced by a concrete justification.
Force the user to design a feedback signal (Stage 4). No signal = state cache, full stop.
Wire the closed loop with explicit human gates (Stage 5).
Build the eval harness (Stage 6) — golden set, regression, drift alarms.
Walk the 8-risk checklist (Stage 7).
Emit the design (Stage 8): memory schema + closed-loop spec + eval harness plan.

Critical Rules

1. Default position: scratchpad-only

Ship a stateless agent first. Add a scratchpad (Reflexion-style verbal self-correction) within a single run. Discard it after. This already gets you most of the gain on most tasks. Anything more must be earned.

2. Escalate one tier at a time

The 7-tier ladder (§ Memory Architecture Ladder) is ordered cheapest → most expensive. Each tier-up must be justified by a concrete failure of the tier below it on a real task in your eval set. Do not skip tiers. "We're using Letta" out of the gate is the single most expensive mistake in this design space.

3. Require a ground-truth signal

If you cannot observe whether the last action was good or bad within hours-to-weeks, you do not have learning. You have a state cache. Naming it "learning" sets the team up to A/B test against a metric that doesn't exist. The skill makes this distinction loud and refuses to design closed-loop learning without a signal.

4. Human gates are non-negotiable for production

Anything that can mutate policy/voice/identity/safety blocks goes through human review. Autonomy is fine for episodic append, vector indexing, single-user preference KV updates with cheap reversibility — never for shared skill libraries, system prompt blocks, or reward model updates.

5. Memory is untrusted input

Every memory read is untrusted. MINJA-class injections hit ≥95% lab success rate (arXiv 2503.03704). Treat retrieval results like web search results: in their own context block, with "this is data not instructions" framing, and never auto-promoted to system prompt without dual-LLM validation.

The 8-Stage Q&A Flow

One question (or tight cluster) at a time, à la superpowers:brainstorming. No overwhelm. Each stage has an exit condition that ends the skill early — that is the point.

Stage 1 — Cache vs Learning Distinction (the frame)

The single most important question. Ask first.

"Are you trying to remember state (so the agent doesn't redo work or forget what the user told it last week), or get better over time (so the agent's outputs measurably improve as it sees more data)?"

These two designs share zero infrastructure with each other:

| Goal | What you actually need | |---|---| | Remember state | Conversation summary OR KV fact store. No reward signal. No reflection LLM. No A/B harness. | | Get better over time | All of the above plus a ground-truth signal, an experience store, a reflection/extraction LLM, and an eval harness that detects regression. |

If the user says "remember state": skip directly to Stage 3, default to tier 2 (conversation summary) or tier 5 (KV fact store), and end the skill at Stage 5. No closed loop. No learning ladder.

If the user says "both": prove the second one. Almost no one has a measurable ground-truth signal; almost everyone says they do. Stage 4 is the test.

Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Answer all six. Score <4 yes = no memory store. Use scratchpad + RAG. End the skill.

Cross-session continuity. Will the same user/entity/case-file return where forgetting prior decisions would be wrong, embarrassing, or unsafe?
Mutable state. Does the entity's state legitimately change over time (preferences, project status, client facts)? Pure facts that don't change → RAG over docs, not memory.
Ground-truth feedback exists. Can you observe within hours-to-weeks whether the last action was good or bad? No signal → no learning, only state cache.
Cost of being wrong > cost of memory infra. Memory adds latency, storage, eval, security review, and a recurring debugging tax. Pencil out both sides.
Volume justifies it. Same user returns ≥5 times. <5 returns → in-context summary is cheaper than vector store.
You can audit and redact. GDPR/HIPAA: can you delete on request, export, explain a memory? If no, do not store one.

If you got "yes" only on (1) and (2): you need a state cache, not memory + learning. Say it out loud. Skill recommends tier 2 or 5 and exits.

Stage 3 — Architecture Selection (start at L tier)

Walk the 7-tier memory architecture ladder (next section). Default recommendation: tier 1 (scratchpad-only). Escalate exactly one tier per concrete justification. Justification = "tier N fails on this specific task in our eval set, here's the trace."

Most "we need memory" requests resolve at tier 2 (conversation summary) or tier 5 (KV fact store). Tier 6 (graph) and tier 7 (hierarchical OS-style / Letta) require >3 entities × >50 relationships and a real long-horizon agent, not a chatbot.

Deep dive: references/architectures.md

Stage 4 — Feedback Signal Design

If Stage 1 ended with "remember state only", skip this stage.

For learning, the signal determines everything. Walk the per-domain table:

| Domain | Signal | Latency | Risk | |---|---|---|---| | Marketing / content | Engagement deltas (CTR, dwell, conversion, save/share) + variant A/B win-rate + brand-safety review | hours-days | Vanity metrics → reward hacking; mitigate with composite reward + brand-fidelity LLM-judge | | Finance / compliance | Audit findings, reconciliation breaks, regulator outcomes | weeks | Sparse signal → use intermediate proxies + sparse human signoff (hybrid RLAIF) | | HAZOP / safety | Incident-DB recall (held-out incident set), expert reviewer agreement | continuous | Never let agent's own write-back update incident DB | | Tutorials / education | Completion rate, comprehension quiz scores, time-to-first-success | minutes-days | Cleanest closed loop — verifier is cheap and online | | Code-emitting agents | Unit tests, type-check, runtime | minutes | The gold standard — verifier is free and deterministic | | General LLM-as-judge | Held-out judge with calibrated rubric | continuous | Sample-audit 5–10% against humans to catch drift |

Rule, repeat once per Q&A session: No signal = state cache, not learning. If the user can't name a signal, do not design a learning loop. Recommend they ship the state cache first, instrument the signal in production, and revisit the skill in a quarter.

Deep dive: references/feedback-signals.md

Stage 5 — Closed-Loop Wiring with Human Gates

If Stage 4 produced no signal, skip this stage and the next two.

The reference closed loop:

[run event: input + agent trace + outputs]
       │
       ▼
[signal collector] ──── engagement / verifier / human review (async)
       │
       ▼
[experience store] (append-only, immutable, signed)
       │   ├── episodic events (raw)
       │   ├── extracted facts (KV)        ← extraction LLM, validated
       │   └── learned skills/playbooks    ← reflection LLM, human-gated
       │
       ▼
[retrieval layer] (hybrid: vector + BM25 + entity link)
       │
       ▼
[state mutator]
       │   ├── AUTONOMOUS: low-risk fields (recency, prefs)
       │   └── HUMAN-GATED: anything that changes policy/voice/identity
       │
       ▼
[next run] ─── core memory in prompt + retrieved episodic + skill lookup

Where humans gate (non-negotiable for production):

Promotion of any item to "core memory" / system-prompt block
Schema changes in graph memory
Skill-library additions used by >1 user (Voyager-style accumulation needs review when shared)
Reward model updates / fine-tunes from agent feedback

Where it can be autonomous: episodic append, vector indexing, retrieval scoring tweaks, single-user preference KV updates with cheap reversibility, Reflexion-style within-task verbal self-correction (lives in scratchpad, not persistent memory).

Stage 6 — Eval Harness

Six patterns, ship at least the first three before going live:

Golden set — 50–500 hand-curated (input, expected behavior, expected memory side-effect) tuples; include adversarial / poisoning attempts.
Regression on memory side-effects — assert get(user, "allergies") == ["peanut"] after run X.
Drift alarms via OpenTelemetry GenAI semconv — judge-score rolling mean, memory-store size growth rate, retrieval hit-rate distribution, % of runs that mutate core memory.
A/B between agent versions — slice traffic, compare composite reward over fixed window.
LLM-as-judge with human calibration — 5–10% audit; recompute judge–human Cohen's κ weekly.
Held-out human-written tasks — never trained on; detects distribution collapse from self-play.

Deep dive: references/eval-harness.md

Stage 7 — Risks Checklist

Walk all 8 once. Each must have a concrete mitigation in the design doc.

Memory poisoning (MINJA, ≥95% lab injection success)
Prompt injection via memory
Reward hacking
Drift / staleness
Context rot / window blowup (200K models often unreliable past ~130K)
Runaway self-modification
Distribution collapse in self-play
Multi-agent context explosion

Deep dive: references/risks.md

Stage 8 — Output

Produce the design document:

Memory schema — chosen tier(s), data model, retention/TTL, redaction hooks
Closed-loop spec — signal source, collector, experience store, retrieval, mutator, human-gate list
Eval harness plan — golden set sketch, regression assertions, OTel metric list, A/B split, judge-calibration cadence
Risk register — 8 risks × 1 mitigation each
Build order — what ships in week 1 (state cache only), week 4 (signal collection on production traffic), week 12 (closed loop activated behind feature flag)

Memory Architecture Ladder (escalate only when justified)

L → L → M → M → M → H → XH
1    2    3    4    5    6    7

| # | Architecture | Use case | Cost | Pitfall | Citation | |---|---|---|---|---|---| | 1 | Scratchpad-only (in-run, discarded) | Multi-step reasoning within one task; ReAct loops; debate transcripts | L | Don't fake durability — make it obvious to LLM and ops nothing persists | Reflexion | | 2 | Conversation summary (rolling LLM compaction into system prompt) | Single-session chat, support tickets, ≤1 day horizon | L | Summaries lossy-compress unpredictably; pin facts verbatim, summarize narrative | Anthropic context engineering | | 3 | Episodic stream (append-only event log, recency × importance × relevance retrieval) | Long-running personas, simulations, journal-style apps where order matters | M | Bespoke scoring; without reflection, bloats fast | Generative Agents (Park et al., 2023) | | 4 | Vector RAG over interactions | Knowledge retrieval, FAQ, doc Q&A, low-personalization | M | Reactive only — won't surface "favorite color" on "birthday" query | Letta — RAG vs Agent Memory | | 5 | Key-value fact store (mem0 single-pass ADD) | Personalization (name, prefs, history), CRM-like agents | M | Bad extractors poison store; need write-time validators | mem0 paper | | 6 | Graph memory (mem0g, AriGraph) | Multi-hop reasoning over relationships | H | Schema drift kills you; LLM-extended schemas degrade into vector store with extra steps | mem0g | | 7 | Hierarchical OS-style (Letta / MemGPT, agent self-edits via tools) | High-stakes long-horizon agents | XH | Self-editing memory is prompt-injection bomb on untrusted input | MemGPT, Letta |

Default recommendation in the skill: start at #1, escalate one tier at a time. Many "we need memory" requests are actually #2.

Deep dive: references/architectures.md

Anti-Patterns (load-bearing — call out before user picks the wrong path)

| Anti-pattern | Test | Fix | |---|---|---| | Memory because it's cool | Adding mem0/Letta to a one-shot pipeline | Skip memory. Stateless + RAG. | | Cache labeled "memory" | No feedback signal exists in the user's domain | Honest naming: call it a "state cache" not "learning". Design accordingly. | | Vector RAG for personalization | "What's my favorite color?" returns nothing because the user never asked it; embeddings can't surface unprompted facts | KV fact store, not vector RAG | | Self-editing memory on untrusted input | Letta with user-pasted content writing into core memory | Quarantined-LLM pattern; never untrusted source → core memory | | Reward hacking via vanity metrics | Engagement-only signal → clickbait drift; finance "% reviewed" → rubber-stamping | Composite rewards: engagement + brand-fidelity judge + sample audit; finance: composite includes materiality threshold + reviewer agreement | | Memory as the first move | Building memory store before the stateless agent has shipped | Ship stateless first. Instrument the signal. Decide a quarter later. | | Graph memory by default | Modeling 1 brand's 5 competitors as a graph | Stay in KV+vector until >3 entities × >50 relationships. Graph schemas drift; LLM-extended schemas degrade into vector stores with extra steps. | | Self-play with no external verifier | Agent training on its own outputs, no held-out signal | Pin a verifier external to the model. V-STaR / Quiet-STaR loops without external verification narrow capability. | | Forgetting context-rot | Stuffing 130K of memory into context "because the model supports 200K" | Compaction + retrieval + sub-agent isolation; 200K models often unreliable past ~130K (Anthropic) |

Self-Improvement Playbook Ladder (cheapest first)

Reflexion → Generative Agents → Voyager → mem0 → Letta
   1            2                  3        4      5

| Tier | Pattern | When | Citation | |---|---|---|---| | 1 | In-loop verbal correction, no persistence | Cheapest learning; the first move before ANY memory store. ~91% pass@1 HumanEval at the time of publication. Lives in the scratchpad. | Reflexion (Shinn et al., 2023) | | 2 | Long-horizon persona / social sims | Memory stream + reflection + planning loop. For agents that need to act in character over days/weeks. | Generative Agents (Park et al., 2023) | | 3 | Skill-library accumulation | Tool-using agents solving novel-but-related tasks; "what worked for Brand X in vertical Y" patterns. | Voyager (Wang et al., 2023) | | 4 | Production fact memory | Chat-like personalization at scale. 91.6 LoCoMo, ~90% token savings vs full-context. | mem0 (arXiv 2504.19413, ECAI 2025) | | 5 | Self-editing hierarchical memory | Highest power, highest attack surface. Use only when long-horizon autonomy is the product, not a nice-to-have. | MemGPT → Letta |

The skill walks the user up this ladder only when justified by a concrete failure of the tier below. Most production systems sit at tier 1 + tier 4. Tier 5 is appropriate for <5% of agentic projects.

Deep dive: references/playbook-ladder.md

Reference Files

| File | Contents | |---|---| | references/architectures.md | Deep-dive on the 7 memory architectures with cost ratings L→XH | | references/feedback-signals.md | Per-domain feedback signal design + the no-signal-no-learning rule | | references/eval-harness.md | The 6 eval patterns: golden set, regression, drift alarms, A/B, judge calibration, held-out tasks | | references/risks.md | The 8 risks with citations and mitigations (MINJA, prompt injection, reward hacking, drift, context rot, runaway self-mod, distribution collapse, multi-agent explosion) | | references/playbook-ladder.md | Reflexion → Generative Agents → Voyager → mem0 → Letta progression | | references/case-studies.md | Brandling Mutation Engine "state cache, not learning" lesson + marketing/finance/HAZOP/tutorial-gen worked examples through the memory/feedback lens |

Examples

The examples/ directory will hold:

reflexion-loop.md — cheapest first move, scratchpad-only
kv-store-mem0.md — production personalization with extraction validation
eval-harness.md — golden set runner with regression assertions

Output Contract

A skill run is complete when the user has:

A documented answer to "cache or learning?" (Stage 1).
A scored need-memory rubric (Stage 2).
A chosen architecture tier with justification for not stopping at the previous tier (Stage 3).
(If learning) a named feedback signal with latency, source, and mitigations (Stage 4).
(If learning) a closed-loop spec with human gates marked explicitly (Stage 5).
An eval harness plan with at least patterns 1–3 from §3.8 (Stage 6).
A risk register: 8 rows × 1 mitigation each (Stage 7).
A build order showing what ships when (Stage 8).

If the user wants to skip steps, the skill refuses. The whole point is the filter.

Design Philosophy

Memory is a liability surface. The cheapest memory is the one you didn't add.

Every memory tier you add carries a recurring debugging tax (why did it remember that? why did it forget this?), a security tax (every read is untrusted input), a privacy tax (GDPR/HIPAA delete-on-request), and an eval tax (regression on memory side-effects). Stateless agents fail in ways you can reproduce by re-running the input. Memoryful agents fail in ways you can't.

The skill's stance: earn each tier with a real failure on a real eval set. When in doubt, ship the lower tier and instrument the signal. Decide next quarter.

ooiyeefei/self-improving-systems

skills/self-improving-systems/SKILL.md

Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.

441 stars

testing

Updated Jun 12, 2026

$ install --global

skillsauth

npx skillsauth add ooiyeefei/ccc self-improving-systems

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 12, 2026, 6:34 AM135.9s11 files scanned

SKILL.md

name:: self-improving-systems
description:: Decide whether your agent actually needs persistent memory, feedback loops, or closed-loop learning, then design the smallest thing that pays for itself. Use when the user says "add memory", "give my agent context management", "make my agent learn", "self-improving / closed-loop", "Reflexion / mem0 / Letta / MemGPT", "AriGraph", "agent memory architecture", "long-term memory for chatbot", "why does my agent keep forgetting / making the same mistake", "fine-tune from agent traces", or asks for a memory schema / experience store / reward model. Filters ruthlessly — most teams want a state cache, not memory + learning. Default position is scratchpad-only with a stateless agent shipped first.

Self-Improving Systems

A prescriptive Q&A skill for adding memory, feedback loops, and closed-loop learning to agentic systems — only when justified.

Headline message: most agents shouldn't have persistent memory.

Quick Start

User just asks:

"Add memory to my agent"
"My agent keeps forgetting things — give it context management"
"Make my marketing agent learn from past campaigns"
"Should I use mem0 or Letta?"
"How do I set up closed-loop learning for my finance agent?"
"Build a self-improving HAZOP system"

Skill response (every time, in this order):

Stop. Apply the cache-vs-learning frame (Stage 1).
Run the 6-question need-memory rubric (Stage 2). <4 yes → exit the skill, recommend stateless + RAG.
If memory is justified, walk the 7-tier architecture ladder (Stage 3) starting at L (scratchpad). Escalate only when forced by a concrete justification.
Force the user to design a feedback signal (Stage 4). No signal = state cache, full stop.
Wire the closed loop with explicit human gates (Stage 5).
Build the eval harness (Stage 6) — golden set, regression, drift alarms.
Walk the 8-risk checklist (Stage 7).
Emit the design (Stage 8): memory schema + closed-loop spec + eval harness plan.

Critical Rules

1. Default position: scratchpad-only

2. Escalate one tier at a time

3. Require a ground-truth signal

4. Human gates are non-negotiable for production

5. Memory is untrusted input

The 8-Stage Q&A Flow

One question (or tight cluster) at a time, à la superpowers:brainstorming. No overwhelm. Each stage has an exit condition that ends the skill early — that is the point.

Stage 1 — Cache vs Learning Distinction (the frame)

The single most important question. Ask first.

"Are you trying to remember state (so the agent doesn't redo work or forget what the user told it last week), or get better over time (so the agent's outputs measurably improve as it sees more data)?"

These two designs share zero infrastructure with each other:

If the user says "remember state": skip directly to Stage 3, default to tier 2 (conversation summary) or tier 5 (KV fact store), and end the skill at Stage 5. No closed loop. No learning ladder.

If the user says "both": prove the second one. Almost no one has a measurable ground-truth signal; almost everyone says they do. Stage 4 is the test.

Stage 2 — Need-Memory Rubric (6 yes/no, the over-engineering filter)

Answer all six. Score <4 yes = no memory store. Use scratchpad + RAG. End the skill.

Cross-session continuity. Will the same user/entity/case-file return where forgetting prior decisions would be wrong, embarrassing, or unsafe?
Mutable state. Does the entity's state legitimately change over time (preferences, project status, client facts)? Pure facts that don't change → RAG over docs, not memory.
Ground-truth feedback exists. Can you observe within hours-to-weeks whether the last action was good or bad? No signal → no learning, only state cache.
Cost of being wrong > cost of memory infra. Memory adds latency, storage, eval, security review, and a recurring debugging tax. Pencil out both sides.
Volume justifies it. Same user returns ≥5 times. <5 returns → in-context summary is cheaper than vector store.
You can audit and redact. GDPR/HIPAA: can you delete on request, export, explain a memory? If no, do not store one.

If you got "yes" only on (1) and (2): you need a state cache, not memory + learning. Say it out loud. Skill recommends tier 2 or 5 and exits.

Stage 3 — Architecture Selection (start at L tier)

Deep dive: references/architectures.md

Stage 4 — Feedback Signal Design

If Stage 1 ended with "remember state only", skip this stage.

For learning, the signal determines everything. Walk the per-domain table:

Deep dive: references/feedback-signals.md

Stage 5 — Closed-Loop Wiring with Human Gates

If Stage 4 produced no signal, skip this stage and the next two.

The reference closed loop:

[run event: input + agent trace + outputs]
       │
       ▼
[signal collector] ──── engagement / verifier / human review (async)
       │
       ▼
[experience store] (append-only, immutable, signed)
       │   ├── episodic events (raw)
       │   ├── extracted facts (KV)        ← extraction LLM, validated
       │   └── learned skills/playbooks    ← reflection LLM, human-gated
       │
       ▼
[retrieval layer] (hybrid: vector + BM25 + entity link)
       │
       ▼
[state mutator]
       │   ├── AUTONOMOUS: low-risk fields (recency, prefs)
       │   └── HUMAN-GATED: anything that changes policy/voice/identity
       │
       ▼
[next run] ─── core memory in prompt + retrieved episodic + skill lookup

Where humans gate (non-negotiable for production):

Promotion of any item to "core memory" / system-prompt block
Schema changes in graph memory
Skill-library additions used by >1 user (Voyager-style accumulation needs review when shared)
Reward model updates / fine-tunes from agent feedback

Stage 6 — Eval Harness

Six patterns, ship at least the first three before going live:

Golden set — 50–500 hand-curated (input, expected behavior, expected memory side-effect) tuples; include adversarial / poisoning attempts.
Regression on memory side-effects — assert get(user, "allergies") == ["peanut"] after run X.
Drift alarms via OpenTelemetry GenAI semconv — judge-score rolling mean, memory-store size growth rate, retrieval hit-rate distribution, % of runs that mutate core memory.
A/B between agent versions — slice traffic, compare composite reward over fixed window.
LLM-as-judge with human calibration — 5–10% audit; recompute judge–human Cohen's κ weekly.
Held-out human-written tasks — never trained on; detects distribution collapse from self-play.

Deep dive: references/eval-harness.md

Stage 7 — Risks Checklist

Walk all 8 once. Each must have a concrete mitigation in the design doc.

Memory poisoning (MINJA, ≥95% lab injection success)
Prompt injection via memory
Reward hacking
Drift / staleness
Context rot / window blowup (200K models often unreliable past ~130K)
Runaway self-modification
Distribution collapse in self-play
Multi-agent context explosion

Deep dive: references/risks.md

Stage 8 — Output

Produce the design document:

Memory schema — chosen tier(s), data model, retention/TTL, redaction hooks
Closed-loop spec — signal source, collector, experience store, retrieval, mutator, human-gate list
Eval harness plan — golden set sketch, regression assertions, OTel metric list, A/B split, judge-calibration cadence
Risk register — 8 risks × 1 mitigation each
Build order — what ships in week 1 (state cache only), week 4 (signal collection on production traffic), week 12 (closed loop activated behind feature flag)

Memory Architecture Ladder (escalate only when justified)

L → L → M → M → M → H → XH
1    2    3    4    5    6    7

Default recommendation in the skill: start at #1, escalate one tier at a time. Many "we need memory" requests are actually #2.

Deep dive: references/architectures.md

Anti-Patterns (load-bearing — call out before user picks the wrong path)

Self-Improvement Playbook Ladder (cheapest first)

Reflexion → Generative Agents → Voyager → mem0 → Letta
   1            2                  3        4      5

The skill walks the user up this ladder only when justified by a concrete failure of the tier below. Most production systems sit at tier 1 + tier 4. Tier 5 is appropriate for <5% of agentic projects.

Deep dive: references/playbook-ladder.md

Reference Files

Examples

The examples/ directory will hold:

reflexion-loop.md — cheapest first move, scratchpad-only
kv-store-mem0.md — production personalization with extraction validation
eval-harness.md — golden set runner with regression assertions

Output Contract

A skill run is complete when the user has:

A documented answer to "cache or learning?" (Stage 1).
A scored need-memory rubric (Stage 2).
A chosen architecture tier with justification for not stopping at the previous tier (Stage 3).
(If learning) a named feedback signal with latency, source, and mitigations (Stage 4).
(If learning) a closed-loop spec with human gates marked explicitly (Stage 5).
An eval harness plan with at least patterns 1–3 from §3.8 (Stage 6).
A risk register: 8 rows × 1 mitigation each (Stage 7).
A build order showing what ships when (Stage 8).

If the user wants to skip steps, the skill refuses. The whole point is the filter.

Design Philosophy

Memory is a liability surface. The cheapest memory is the one you didn't add.

The skill's stance: earn each tier with a real failure on a real eval set. When in doubt, ship the lower tier and instrument the signal. Decide next quarter.

Related Skills

ooiyeefei/agentic-system-design

tools

VerifiedTrustedCommunity

Prescriptive Q&A workflow for designing agentic pipelines, multi-model councils, sub-agent hierarchies, and tool-loop hardening for any domain. Use when the user asks to "design an agent", "design a multi-agent system", "should I use a council/debate", "build a [domain] review agent" (HAZOP, finance, tutorial, marketing, compliance, accounting), "real agency vs workflow", "how to add sub-agents", "AI for [domain] review", or names patterns like "orchestrator-worker", "evaluator-optimizer", "Magentic", "ReAct", "plan-and-execute", "handoffs". Walks the user through 12 stages one question at a time and emits a buildable design doc with citations. Do NOT use for general coding questions, single-shot prompt tuning, or bare "use Claude to do X" requests with no agency requirement.

441SKILL.mdUpdated May 9, 2026

ooiyeefei/agentic-system-design

ooiyeefei/landing-page-gtm

development

VerifiedTrustedCommunity

Build and update high-converting SaaS landing pages with GTM-aware marketing copy, competitive positioning, and sales psychology. Use when creating new landing pages, rewriting feature cards, updating marketing copy, launching product pages, or transforming technical features into customer-facing sales language. Triggers on "build landing page", "update feature cards", "rewrite marketing copy", "create product page", "launch page", "GTM", "sales copy", "competitive positioning", or when converting product features into conversion-focused web pages.

441SKILL.mdUpdated May 6, 2026

ooiyeefei/landing-page-gtm

ooiyeefei/htmldrop

development

VerifiedTrustedCommunity

This skill should be used when the user asks to "share this HTML", "publish HTML", "get a link for this file", "share this report", "make this shareable", "upload this HTML", or wants to publish any HTML artifact for others to view. ALSO use it for collaborative review on an HTML doc/spec/report — triggers include "get feedback on this", "let reviewers comment", "collect feedback", "share for review", "let people annotate this", "synthesize the feedback", "converge the feedback", "what did reviewers say", "incorporate the comments", or "improve this from the feedback". Wraps Surge.sh for zero-cost hosting with guided privacy options, plus an embedded annotation + AI converge workflow.

428SKILL.mdUpdated May 15, 2026

ooiyeefei/uat-testing

tools

VerifiedTrustedCommunity

End-to-end User Acceptance Testing for web applications. Analyzes branch changes and specs to generate exhaustive test cases, sets up the local environment, executes tests via Playwright browser automation, and produces a pass/fail results report with screenshots and fix documentation. Use when the user says "run UAT", "test this feature", "UAT testing", "acceptance test", "test my branch", "generate test cases", or wants to verify a feature branch against its spec before merge.

416SKILL.mdUpdated May 6, 2026

ooiyeefei/uat-testing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ooiyeefei/ccc.git

# Copy into Claude Code skills folder (global)
cp -r ccc/skills/self-improving-systems ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ooiyeefei/ccc

441 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT