skills/golem-powers/weave/SKILL.md
Orchestrator-only convergence workflow: mine recent Claude/Codex JSONLs, weave cited findings into an action ledger, route every finding to a disposition, then red-team facts against raw logs. Triggers: weave, weave now, run weave, session weave, convergence weave. Use only when fleet is quiet; NOT for single-session mining or web research.
npx skillsauth add etanhey/golems weaveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/weaveAt convergence, fan out deep cross-session mining, then prove each finding turned into a real change — or kill it.
/weave is an orchestrator skill. The orchestrator wields it and relays the
mining sub-tasks to workers; workers do not self-invoke it. It operates
across sessions and repos at the moment the fleet goes quiet, catching the
cross-cutting issues no single worker ever sees — recurring frustrations, skill
gaps, anti-patterns, what-worked-vs-what-hurt — and routes each into a ledger
that makes waste visible.
The defining feature is not the fan-out. It is the action-ledger that makes token-waste impossible to hide. Anyone can fan out N agents and produce nice docs; the ledger is what forces every finding to a disposition and reports conversion-to-change.
⚠️ This skill, and
batch-session-minersbefore it, were lost once because they lived as untracked WIP in gitignoreddocs.local/. A weave run's harness, ledger, and retro are committed artifacts. Never leave them untracked. That discipline is the whole reason this skill exists as a skill.
| Spoken | Effect |
|---|---|
| "weave" | Arms the skill: loads it, runs the convergence gate (scripts/convergence-gate.sh), and waits until all conditions pass. Never auto-fires. |
| "weave now" | Fires immediately, skipping the gate. Operator override — use only when the operator knows the fleet is quiet. |
A multi-way deep-mining fan-out next to the live Opus fleet = OOM + token contention. The weave fires ONLY at convergence. All four must be true:
Plus a RAM gate (same lesson as content-demo's ram-gate.sh — quiesce first).
bash scripts/convergence-gate.sh # checks 1–3 + RAM; #4 needs operator ack
bash scripts/convergence-gate.sh --ack-demo # operator asserts demo approval → can PASS
Condition #4 is not script-checkable — the gate stays BLOCKED until the
operator passes --ack-demo (or WEAVE_DEMO_ACK=1). Human/orchestrator in the
loop; never auto-fire. "weave now" is the only path that skips the gate.
The weave mines the recent Claude + Codex session JSONLs (~/.claude/projects/**
~/.codex/sessions/**). "Web" = weaving a web across sessions, NOT web-search.
For Claude sessions it leans on the /skill-creator mining engine (the
deterministic parser session-miner.py — Claude-only — + the session-miner
sub-agent). Codex sessions have a different log shape and are NOT parsed by
session-miner.py; this skill's own prepare-mine-context.py handles both
formats (digest-if-present + keyword-grep excerpts), so a miner reads a uniform
context file regardless of source. The whole thing is wrapped by this skill's
reproducible harness:# WD = the SCRATCH run dir. digests/ + findings/ + mine-context/ are BULKY and
# fully REGENERABLE (re-run discover/prepare), so they may live in gitignored
# docs.local/. They are NOT the durable artifact — do not rely on them surviving.
WD=~/Gits/orchestrator/docs.local/weave-$(date +%F)
python3 scripts/weave-run.py discover --hours 24 --workdir "$WD" # → corpus-manifest.json (centerpieces ★ first)
python3 scripts/weave-run.py prepare --workdir "$WD" # → digests/ + mine-context/ (compact per-session)
python3 scripts/weave-run.py batches --workdir "$WD" --size 5 # → batch-manifest.json (one miner per session)
# ... dispatch one miner agent per session per batch (see §3) ...
python3 scripts/weave-run.py aggregate --workdir "$WD" --tokens <N> # → ACTION-LEDGER.md + ledger.json + conversion-to-change
# ... synthesize the forward-plan from the ledger ...
# ... THEN the MANDATORY red-team fact-check closing stage (§4b) before anything is trusted ...
⚠️ The durable artifact is NOT the scratch WD. The bulky digests/findings are regenerable; what must be committed (so it can't be lost like the original weave) is the conclusions: the conversion-to-change metrics + the high-importance routed findings + the retro. Commit those into this skill's tracked
retros/<date>.md(andbrain_storethem). Never leave the conclusions only indocs.local/.
discover tags them ★ and orders them first;
mine them deepest (references/topology.md).mine-context/<label>.md (digest + grep excerpts),
then grep the raw JSONL only to quote verbatim. They never read a whole MB-scale
JSONL into context.jsonl_line=N NOT digest §Nprepare-mine-context.py emits jsonl_line=N for grep excerpts — these are
raw JSONL line numbers (1-based enumerate over the file). Session-miner digest
§N / event indices are NOT jsonl lines — misciting them produces false evidence
(singles#2). Miners grep the raw JSONL by jsonl_line= only.
Run scripts/validate-dispatch-brief.py on every rendered §EDITS worker brief before
dispatch. It checks required sections (Scope, Sources, Mechanics) and that the
findings JSON schema fence is intact — one brief shipped truncated (singles#3).
Worker prompts MUST scope evidence verification to the files and line-ranges the
ledger already cites — never directory-wide rg over ~/.claude/projects or the
whole weave run dir. Unscoped greps burned one worker to compaction before edits
started (09-59-00#6).
discover --exclude-self <stem> records the exclusion in
corpus-manifest.json → deliberate_exclusions[] with reason (not silent skip).Each miner emits a findings JSONL — one object per line — at
<WD>/findings/<label>.jsonl:
{"id":"<label>#N","title":"...","detail":"...","evidence":"verbatim — source ref [line N]","type":"correction|frustration|anti-pattern|skill-gap|skill-candidate|decision|residual-bug|what-worked","track":"cmuxLayer|BrainLayer|VoiceLayer|MCL|MCP-layer|skill-creator|dashboard|plans|collab|cross-cutting","disposition":"MERGED-PR|PR-FILED|PR-FIX|SKILL-NEW|SKILL-EDIT|DEEP-RESEARCH|FOLLOW-UP|REJECTED|PARKED|KEEP|DUPLICATE","importance":1-10,"recurring":true|false}
Disposition → conversion class (what weave-ledger.py enforces): converged =
MERGED-PR/PR-FILED/PR-FIX/SKILL-NEW/SKILL-EDIT; open =
DEEP-RESEARCH/FOLLOW-UP; dropped (reason REQUIRED) =
REJECTED/PARKED/DUPLICATE; confirmation (excluded from the refined
denominator) = KEEP. (PR-FILED = PR opened but not yet merged;
FOLLOW-UP-FILED is accepted as an alias of FOLLOW-UP.)
Rules (inherit /never-fabricate + the session-miner discipline): verbatim
evidence with a [line N] or digest §N cite; no brain_store (files only);
dedup; suppress loop/cron noise; an empty session → an empty findings file, never
an invented one.
Dispatch (the bridge between batches and aggregate): spawn ONE miner per
session per batch (≤5 concurrent), centerpieces first. Two equivalent mechanisms:
pipeline/parallel of agent()
calls, each given the miner prompt below + a findings JSON schema; the workflow
writes each result to <WD>/findings/<label>.jsonl.session-miner sub-agent / Task calls from a skillCreator session (the
sub-agent is skill-creator-scoped): dispatch N Agent(subagent_type="session-miner", …)
in one message.Miner prompt skeleton:
Mine ONE session for the weave. Read your context file <WD>/mine-context/<label>.md
(digest + grep excerpts). For verbatim quotes, grep the raw JSONL by **jsonl_line=N**
from the Grep excerpts section — never read the whole file. Digest §N is NOT a jsonl
line. Emit findings (the JSON schema above), one per line, to
<WD>/findings/<label>.jsonl. Files only — no brain_store. Empty session → empty file.
Return: WEAVE_MINE_DONE <label> <count>.
VERIFY ON DISK — do not trust the agent's word (the weave's own thesis, applied to itself).
A miner may report success in its return value yet never have written the file
(observed live: a forced structured-output return competed with the Write step,
so ~70% of a 102-agent run self-reported wrote_file:true with no file on
disk). The findings FILE is authoritative, not the agent's claim. After every
batch run weave-run.py status --workdir "$WD" (it checks the actual files), and
re-mine any session with no file — with the file-Write as the terminal
deliverable and NO competing return schema. Loop batches until dry. (If you mine
via a Workflow with schema:, the agent often stops at the schema call and skips
the Write — prefer "Write the file, then reply WEAVE_MINE_DONE <label> <count>"
with no schema, and verify on disk.)
scripts/weave-ledger.py aggregates all findings into ACTION-LEDGER.md +
ledger.json and computes the metric that decides if the weave was worth it.
Two denominators are reported — both, always:
KEEP confirmations in the denominator so
the ratio can't be flattered by reclassifying findings as "not actionable."DEEP-RESEARCH,FOLLOW-UP) + dropped (REJECTED,PARKED,
DUPLICATE); KEEP is excluded because you can't "convert" a validated
what-worked. The refined number is informative, but the strict ÷-total number
is the one that governs the SHIP/RETIRE decision (see EVAL.md).
(converged = MERGED-PR+PR-FILED+PR-FIX+SKILL-NEW+SKILL-EDIT.)--tokens N).--strict exits non-zero if any finding has an
unknown disposition or is DROPPED without a reason. Route EVERY finding.A weave with 0% conversion is token-burn — the ledger surfaces that instead of letting "we produced N nice docs" pass for progress.
After synthesis, before ANY finding is trusted or acted on, a red-team
workflow verifies every load-bearing fact in the ledger + synthesis against
the raw JSONLs. This is non-negotiable — it is the anti-hallucination guard
for the L0 memory problem: a wrong fact that reaches the plan or brain_store
poisons every downstream decision. (Proven valuable: the 2026-05-29 weave's
red-team caught a wrong WhatsApp number for Etan plus several other wrong
facts that would otherwise have been acted on.)
Anchor on the highest-trust ground truth — what the OPERATOR said and did:
type:user turns only. Do not cite relays, queue-operation,
last-prompt, task-notification, worker summaries, or assistant
brain_store paraphrases as Etan evidence. If the quote appears in both a
relay and a raw user turn, cite the raw user turn. If no raw user turn exists,
mark the claim as relay-only and do not treat it as verified operator speech.[line N] must
resolve to the quoted text; every attribution (who did what, which repo, which
PR) must hold.Mechanism (a fan-out workflow): one verifier per batch of claims, each
re-greps the raw JSONL and returns {claim, verbatim_match, correct_attribution, corrected_value, verdict}. Any claim that fails is corrected in place or
dropped with a reason in the ledger before the plan/retro/brain_store are
trusted. Default to skeptical: if a fact can't be confirmed against the JSONL,
treat it as unverified and flag it. (Same discipline as /never-fabricate + the
session-miner GAP-REPORT.)
A weave's findings are only as trustworthy as this stage makes them. No weave output is "done" until the red-team fact-check has run and its corrections are folded back into the ledger.
A weave that only captures is a diary, not convergence.
After the red-team pass, the run is NOT complete until every SKILL-EDIT or SKILL-NEW item in the weave doc's skill-candidates/edits section (the §3-style "SKILL CANDIDATES / EDITS" list) is in exactly one of three discharge states:
| State | Required proof |
|---|---|
| APPLIED | PR link recorded next to the item (merged or open via /pr-loop) |
| DISPATCHED | Named owner + collab/inbox link where the owner acknowledged the item |
| TRACKED | Ledger row with confidence + evidence_count, plus the exact evidence references that future weaves can re-check |
No fourth state. "Captured for later" is the failure mode this phase exists to kill: the gen-10 weave captured 9 behavioral fixes that were never applied — gen-11 then re-violated them and Etan received the SAME lead-topology correction again (2026-06-05). The weave doc + retro MUST include the discharge table (item → state → proof). Do not count TRACKED as APPLIED in conversion-to-change; the point is to prevent loss without pretending every suggestion became a code or skill change.
TRACKED satisfies §4c when the doctrine says to observe before editing. Etan's
raw type:user doctrine, typo preserved: "we can use pheonix to track it or the
ledger... suggestions are possible, not always taken as necessary"
(orchestrator ce4072bf:[4960]). Use TRACKED for skill-behavior suggestions
that need evidence across sessions or divergent agents; promote TRACKED -> APPLIED
only when the evidence matures into a multi-instance or unambiguous failure.
The orchestrator running the weave owns this phase. If an item's owner is another LEAD, DISPATCHED requires the dispatch to have actually landed (collab ack or monitor loop armed) — not a parking-lot note (orc C7: dispatch now, not later).
Etan (2026-06-05, verbatim): "Everything should be moved from the weaving from the previous session to the new session, not just the top things. Everything should be relayed so we get a very good next orc."
The successor orc's boot MUST require:
brain_search tag sweep of ALL stored corrections from the closing session
(orc-correction + frustration-capture stores), read in full.Boot docs may summarize for orientation, but the summary MUST link the full weave doc and MANDATE the full read + ACK. A boot doc that relays only "the top things" is a relay failure — the dropped tail is exactly where re-violations come from.
/weave → next-gen large-plan → orchestrator runs parallel tracks → ship → re-weave.
The ledger is the compounding instrument across loops.
Deliverable 2 is the highest-priority emit (Etan emphatic 3×), even though Deliverable 1 is the terminal artifact. If you can only land one, land the gate.
/large-plan covering up
to 5 parallel tracks, each track populated by mined findings, not
invented. Big initial collab = modularize/componentize first, then 4–5
LEAD orchestrators (one per track). The spec's named tracks:
/never-fabricate + the /qa-video
method-attribution rule.)
LSUIElement menu-bar app — that's
a grant/focus state, not an impossibility. Fallbacks when it blocks direct
CU: screencapture CLI + coordinate clicks, or an in-app PNG-export
affordance (qa-video hotspot #13).After each run, write retros/<date>.md: what we learned, what to improve next
time (better miner prompts, better disposition routing, what got missed), what
the red-team fact-check (§4b) caught and corrected (the wrong facts that would
otherwise have shipped), the §4c application table (every edit item APPLIED or
DISPATCHED with links), and a delta vs the prior weave. brain_store the
conclusions. The next weave starts
from the last retro — that's the compounding. (The reason this never snowballed
before: the weave was never built or committed. Fixed here, permanently.)
A planned weave dimension (don't block the first run on it): whenever a new model
drops, the weave should auto-surface skill-relevance actions. Research agents
track what changed between the new model and the prior one (thinking/capability
deltas, what it now does natively), and for each skill the weave proposes:
still needed? · model now smart enough → RETIRE · or make it LEANER? ·
what changed in the model's reasoning that affects it? This ties straight into
/skill-creator's capability-uplift vs encoded-preference classification:
capability-uplift skills obsolesce as models improve; encoded-preference skills
endure. Routing these proposals through the ledger turns each model bump into more
conversion-to-change. (Fold into a future iteration / retro — not the first build.)
| Path | Role |
|---|---|
| scripts/convergence-gate.sh | The 4-condition gate + RAM check; arms "weave", bypassed by "weave now" |
| scripts/weave-run.py | Reproducible orchestrator: discover → prepare → batches → aggregate |
| scripts/prepare-mine-context.py | Compact per-session context (digest + grep excerpts with jsonl_line=N) for one miner; Claude + Codex formats |
| scripts/validate-dispatch-brief.py | Pre-send validation for §EDITS worker briefs (schema + required sections) |
| scripts/weave-ledger.py | Action-ledger + conversion-to-change + routing-contract enforcement |
| references/topology.md | flat-N vs staged, batch size, centerpieces-first, the round structure |
| EVAL.md | Backtest baseline, flat-vs-staged eval, the conversion metric, smoke checks |
| evals/fixtures/findings-{clean,violations}/ | Committed smoke fixtures: clean → --strict exit 0; violations → exit 2 |
/orc convergence step)./skill-creator (session-miner + session-miner.py);
/weave is the orchestrator wrapper that arms it, gates it on convergence, and
routes its findings through the action-ledger. (batch-session-miners was
folded into /skill-creator — single source of truth for mining; /weave is
the only batch-mining orchestrator skill.)/skill-creator — the mining engine (mine-session, session-miner sub-agent, the parser)./large-plan — the weave's top output is a large-plan; tracks come from findings./never-fabricate — every finding cites verbatim evidence; no invented ledger rows./pr-loop — converting a SKILL-EDIT/PR-FIX finding to MERGED-PR goes through it./orc — convergence detection + dispatch of miners; surfaces the ledger to Etan.tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).