skills/windags-curator/SKILL.md
Post-execution skill crystallization and learning engine updates for WinDAGs. Runs after successful execution to update Thompson sampling parameters, track method quality, detect monster-barring, log near-miss events, and signal Kuhnian crises. Activate on "curator", "learning update", "skill crystallization", "Thompson sampling", "monster-barring", "near-miss", "Kuhnian crisis", "post-execution learning". NOT for pre-execution risk scanning (use windags-premortem), retrospective analysis (use windags-looking-back), or DAG construction (use windags-architect).
npx skillsauth add curiositech/windags-skills windags-curatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Update the learning engine after every execution. Track skill and method quality with Thompson sampling. Detect monster-barring and Kuhnian crises. Crystallize new skills from execution traces. Produce a LearningResult that feeds back into the Knowledge Library.
Model Tier: Tier 1 (Haiku-class) Behavioral Contracts: BC-LEARN-001 through BC-LEARN-006
Use this skill when:
Do NOT use for:
windags-premortem)windags-looking-back)windags-mutator)Update after every node using the EVALUATOR score, not self-assessment.
For each completed node:
skill = node.assigned_skill
evaluator_score = node.evaluator_result.quality_score # 0.0 to 1.0
skill.thompson.alpha += evaluator_score
skill.thompson.beta += (1.0 - evaluator_score)
Self-assessment is excluded. Principle 8 (Self-Eval Is Unreliable) forbids using a node's own quality estimate. Only the EVALUATOR's score drives Thompson updates.
Track method quality independently of skill quality. A skill wraps a method; they can diverge.
For each completed node:
method = node.assigned_skill.method
evaluator_score = node.evaluator_result.quality_score
method.thompson.alpha += evaluator_score
method.thompson.beta += (1.0 - evaluator_score)
Scenarios where this distinction matters:
Log a NearMissEvent when a node passes within 10% of its quality threshold.
threshold = node.quality_threshold # e.g., 0.70
score = node.evaluator_result.quality_score
margin = score - threshold
if 0 < margin <= (threshold * 0.10):
log NearMissEvent:
node_id: node.id
skill_id: node.assigned_skill.id
score: score
threshold: threshold
margin: margin
context: node.execution_context
Near-misses are more valuable than clean passes or clear failures. They reveal boundary conditions where the skill is fragile.
On every skill revision, track the growth rate of NOT_FOR clauses relative to WHEN_TO_USE clauses.
flowchart TD
REV[Skill revision triggered] --> TRACK[Count NOT_FOR vs WHEN_TO_USE changes]
TRACK --> RATIO{NOT_FOR growth > WHEN_TO_USE growth?}
RATIO -->|Yes, over last 3 revisions| ALERT[Monster-Barring ALERT]
RATIO -->|No| OK[Healthy skill evolution]
ALERT --> LOG[Log to LearningResult.monster_barring_alerts]
This is a Lakatosian degenerating research programme signal. A skill that responds to failure by narrowing its scope rather than improving its capability is degenerating. It is "barring monsters" -- excluding counterexamples instead of accommodating them.
Track these counters per skill:
not_for_additions: Count of NOT_FOR items added across revisionswhen_to_use_additions: Count of WHEN_TO_USE items added across revisionsrevision_count: Total revisionsAlert when: not_for_additions > when_to_use_additions over the last 3 consecutive revisions.
All learning state must be compatible with CRDTs for future distribution.
alpha, beta) are additive -- they form a G-Counter naturally.Never subtract from Thompson parameters. Never delete quality history entries. Never reset counters.
When execution traces reveal a reusable pattern, crystallize it into a new skill draft.
All four conditions must be met:
3+ verified successes: The pattern has produced EVALUATOR-verified quality scores >= threshold in at least 3 distinct executions.
Average quality >= 0.75: Mean EVALUATOR score across all uses is at least 0.75.
Pattern is generalizable: The pattern applies beyond the specific problem instance. Check:
Not a duplicate: No existing skill covers the same signature + method combination with overlapping context conditions.
flowchart TD
TRACES[Execution traces] --> FILTER[Filter: quality >= 0.75, count >= 3]
FILTER --> GEN{Generalizable?}
GEN -->|No| SKIP[Skip — too specific]
GEN -->|Yes| DUP{Duplicate of existing skill?}
DUP -->|Yes| MERGE[Merge evidence into existing skill]
DUP -->|No| DRAFT[Draft new skill]
DRAFT --> INIT[Initialize Thompson: alpha=sum(scores), beta=sum(1-scores)]
INIT --> CANDIDATE[Add to crystallization_candidates]
A crystallized skill draft contains:
name: Derived from the method and domaindescription: What it does, generated from execution contextmethod: The method that was usedsignature: Input/output types extracted from tracesprompt_template: Generalized from the specific prompts usedinitial_thompson: Pre-seeded from execution history (not cold-start)source_executions: References to the traces that produced this skillCrystallized skills are candidates, not final. They require human review or 3 more successful uses before promotion to the Knowledge Library.
Detect when a skill's actual quality distribution has diverged significantly from its expected distribution. This signals a paradigm shift -- the skill's model of the problem domain is no longer accurate.
Compute PSI as the Hellinger distance between the expected and actual quality distributions.
Given:
expected = Beta(skill.thompson.alpha, skill.thompson.beta)
actual = empirical distribution of last N evaluator scores (N = min(20, available))
PSI = Hellinger_distance(expected, actual)
The Hellinger distance ranges from 0 (identical distributions) to 1 (completely different).
| PSI Value | Signal | Action |
|-----------|--------|--------|
| < 0.15 | Normal | No action |
| 0.15 - 0.24 | Drift | Log for monitoring |
| 0.25 - 0.39 | Pre-crisis | Flag in crisis_signals, increase exploration rate for this skill |
| >= 0.40 | Crisis | Search for replacement skill, consider skill retirement |
When PSI >= 0.25:
crisis_signals in the LearningResultRun this pipeline after every DAG execution.
flowchart TD
EX[Execution complete] --> NODES[Iterate over completed nodes]
NODES --> TS[1. Thompson Sampling Updates — BC-LEARN-001]
TS --> MQ[2. Method Quality Updates — BC-LEARN-002]
MQ --> NM[3. Near-Miss Detection — BC-LEARN-004]
NM --> MB[4. Monster-Barring Check — BC-LEARN-005]
MB --> KC[5. Kuhnian Crisis Detection]
KC --> SC[6. Skill Crystallization Check]
SC --> RESULT[Produce LearningResult]
Steps 1-4 run per node. Steps 5-6 run per unique skill used in the DAG.
Produce a LearningResult with these fields:
LearningResult:
thompson_updates:
- skill_id: string
alpha_delta: number # Amount added to alpha
beta_delta: number # Amount added to beta
new_alpha: number # Updated alpha
new_beta: number # Updated beta
node_count: number # How many nodes used this skill
method_updates:
- method_id: string
alpha_delta: number
beta_delta: number
new_alpha: number
new_beta: number
crystallization_candidates:
- name: string
method: string
signature: string # Input/output type description
source_execution_count: number
average_quality: number
initial_alpha: number
initial_beta: number
status: "candidate" | "promoted"
monster_barring_alerts:
- skill_id: string
not_for_growth_rate: number # Additions per revision
when_to_use_growth_rate: number
revision_window: number # How many revisions analyzed
severity: "warning" | "critical"
recommendation: string
near_miss_events:
- node_id: string
skill_id: string
score: number
threshold: number
margin: number
crisis_signals:
- skill_id: string
psi: number # Hellinger distance
level: "drift" | "pre-crisis" | "crisis"
expected_mean: number
actual_mean: number
replacement_candidates: [string] # Skill IDs, if any
The Curator sits after the Evaluator in the meta-DAG pipeline:
flowchart LR
EV[Evaluator] -->|success| CU[Curator]
EV -->|failure| MU[Mutator]
CU --> LB[Looking Back]
CU -.->|updates| KL[Knowledge Library]
CU -.->|updates| FPL[Failure Pattern Library]
The Curator writes to two persistent stores:
The Curator does not block the pipeline. The Looking Back agent can start immediately; the Curator's writes to the Knowledge Library are asynchronous.
| Operation | Target | |-----------|--------| | Thompson update per node | < 50ms | | Near-miss check per node | < 10ms | | Monster-barring check per skill | < 100ms | | PSI computation per skill | < 200ms | | Crystallization check | < 500ms | | Total Curator overhead | < 2% of total execution cost |
The Curator must never become expensive enough to discourage frequent execution. Learning is the moat (Principle 10); making learning costly defeats the purpose.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.