skills/golem-powers/ecosystem-health/SKILL.md
Run ecosystem health checks — MCP connections, BrainLayer stats, skill evals, friction scans. Use this skill when asked about ecosystem health, maintenance checks, skill monitoring, 'is everything working', 'run a health check', 'what's broken', or when proactively auditing the system. Also triggers for 'maintenance Claude', 'ecosystem audit', 'skill eval', 'MCP status', or 'BrainLayer health'. Run this before and after major changes to catch regressions.
npx skillsauth add etanhey/golems ecosystem-healthInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audit the golems ecosystem: MCP servers, BrainLayer, VoiceLayer daemon, JSONL watcher, enrichment, git status, Axiom telemetry, open PRs. Run at session start AND end.
The ecosystem has 22 repos, 45+ skills, 5 MCP servers (BrainLayer, VoiceLayer, cmux, exa, supabase), a 7GB semantic memory store, a JSONL watcher, enrichment pipeline, and multiple long-lived Claude sessions. Things break silently — MCP disconnects, daemons crash, watchers die, enrichment stalls, PRs rot. This skill catches those problems before the user notices.
Run ALL checks. Report results in table format. Stop and flag if any check is RED.
Check each MCP server is connected and responsive:
# BrainLayer — must respond with data
brain_recall(mode="stats")
# GREEN if: returns chunk count + entity count
# RED if: timeout, error, "unavailable"
# VoiceLayer — silent ping
voice_speak(message="Health check", mode="think")
# GREEN if: no error (think mode = silent log only)
# RED if: timeout or MCP unavailable
# cmux — check surface listing
mcp__cmux__list_surfaces()
# GREEN if: returns list (even empty)
# RED if: timeout or MCP unavailable
# exa — verify connection (no query needed, just check tool exists)
# GREEN if: exa tools appear in available tools
# RED if: not listed
# supabase — verify connection
# GREEN if: supabase tools appear in available tools
# RED if: not listed
If any MCP is RED: Report it immediately. MCP failures cascade — BrainLayer down means no memory, VoiceLayer down means no voice, cmux down means no agent coordination.
Go deeper than connectivity — verify BrainLayer actually returns meaningful data:
brain_search("ecosystem health check")
# GREEN if: returns results (any)
# YELLOW if: returns empty (index may need rebuild)
# RED if: timeout or error
Also check DB vitals:
python3 -c "
import sqlite3, os
candidates = [
'~/.local/share/brainlayer/brainlayer.db',
'~/.local/share/zikaron/zikaron.db',
]
db = None
for c in candidates:
p = os.path.expanduser(c)
if os.path.exists(p) or os.path.exists(p + '-shm'):
db = p
break
if not db:
print('RED: No BrainLayer DB found')
exit(1)
db_exists = os.path.exists(db)
wal_path = db + '-wal'
wal_size = os.path.getsize(wal_path) if os.path.exists(wal_path) else 0
shm_exists = os.path.exists(db + '-shm')
if not db_exists and shm_exists:
print(f'WAL-only mode (MCP holds DB in memory)')
print(f'WAL: {wal_size / 1e6:.0f} MB')
else:
db_size = os.path.getsize(db)
conn = sqlite3.connect(f'file:{db}?mode=ro', uri=True)
chunks = conn.execute('SELECT COUNT(*) FROM chunks').fetchone()[0]
conn.close()
print(f'Chunks: {chunks}')
print(f'DB: {db_size / 1e9:.1f} GB | WAL: {wal_size / 1e6:.0f} MB')
status = 'RED' if wal_size > 500e6 else 'YELLOW' if wal_size > 100e6 else 'GREEN'
print(f'WAL status: {status}')
"
Thresholds:
Check the Voice Bar app + MCP daemon are running:
# Voice Bar app — persistent macOS server on /tmp/voicelayer.sock
pgrep -x "Voice Bar" || pgrep -x "VoiceBar" || echo "RED: Voice Bar not running"
# VoiceLayer MCP daemon — singleton on /tmp/voicelayer-mcp.sock
pgrep -fl "mcp-server-daemon" || echo "YELLOW: VoiceLayer daemon not running"
# Socket file exists and is a socket
test -S /tmp/voicelayer.sock && echo "Voice Bar socket: OK" || echo "RED: No Voice Bar socket"
test -S /tmp/voicelayer-mcp.sock && echo "MCP daemon socket: OK" || echo "YELLOW: No MCP daemon socket"
# Check daemon log for recent errors
tail -5 /tmp/voicelayer-mcp-daemon.stderr.log 2>/dev/null | grep -i "error\|fatal\|crash" && echo "YELLOW: Recent daemon errors" || echo "Daemon log: clean"
Expected state:
/tmp/voicelayer.sock and /tmp/voicelayer-mcp.sockThe BrainLayer JSONL watcher tails ~/.claude/projects/*.jsonl and (arbitrated mode) enqueues chunks to ~/.brainlayer/queue/ for the drain/BrainBar consumers.
Check launchd LOADED state first, not just the process. The 2026-06-06 incident: the watcher was silently dead for 20 hours because the service was unloaded during a load-shed (launchctl bootout) and never restored — KeepAlive cannot revive an unloaded service, and a pgrep-only check tells you it's dead but not why or how to fix it.
# 1. Service LOADED? (unloaded = load-shed casualty; KeepAlive can't help)
launchctl list | grep -q "com.brainlayer.watch" \
|| echo "RED: com.brainlayer.watch UNLOADED — fix: launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.brainlayer.watch.plist"
# 2. Process alive? (message is neutral — if step 1 said UNLOADED, that's the cause; fix there first)
pgrep -fl "brainlayer watch" || echo "RED: no watcher process"
# 3. Offsets freshness — the watcher writes ~/.local/share/brainlayer/offsets.json on progress.
# Stale offsets while Claude sessions are active = watcher wedged or dead.
offsets_age_min=$(( ( $(date +%s) - $(stat -f %m ~/.local/share/brainlayer/offsets.json 2>/dev/null || echo 0) ) / 60 ))
echo "offsets.json age: ${offsets_age_min}min"
# GREEN <15min (with active sessions) | YELLOW 15-60min | RED >60min
# 4. Queue backlog (arbitrated mode) — events piling up = consumers (drain/BrainBar) dark
queue_count=$(find ~/.brainlayer/queue -type f 2>/dev/null | wc -l | tr -d ' ')
echo "queue events: $queue_count"
# GREEN <100 | YELLOW 100-1000 | RED >1000 (also check the sibling services below)
# 5. Sibling services that die in the same load-sheds (same bootstrap fix as #1):
for svc in com.brainlayer.drain com.brainlayer.enrichment; do
launchctl list | grep -q "$svc" || echo "YELLOW: $svc UNLOADED"
done
Why this matters: date-windowed write counters (e.g. BrainBar's "Recent writes" gauge) only see stamped chunks — a dead watcher reads as "3 writes/hour", not as an alarm. Silence looks like quiet, not failure. After ANY load-shed or reboot, re-run this section and re-bootstrap every unloaded com.brainlayer.* service.
The enrichment pipeline processes raw chunks into entities, relations, and embeddings:
# Check if enrichment is running
pgrep -fl "brainlayer.*enrich" || pgrep -fl "enrichment" || echo "INFO: No enrichment running (may be idle)"
# Check enrichment stats via BrainLayer
# brain_recall with stats mode includes enrichment info
Also check via MCP:
brain_recall(mode="stats")
# Look for: enrichment_pending count, last_enrichment timestamp
# GREEN if: pending < 100 and last enrichment < 1 hour ago
# YELLOW if: pending 100-1000 or last enrichment > 6 hours
# RED if: pending > 1000 or last enrichment > 24 hours
Uncommitted changes = risk of lost work. Check all key repos:
for repo in golems brainlayer voicelayer orchestrator cmuxlayer; do
if [ -d ~/Gits/$repo ]; then
status=$(git -C ~/Gits/$repo status --porcelain 2>/dev/null | head -5)
branch=$(git -C ~/Gits/$repo branch --show-current 2>/dev/null)
if [ -z "$status" ]; then
echo "✓ $repo ($branch): clean"
else
count=$(git -C ~/Gits/$repo status --porcelain 2>/dev/null | wc -l | tr -d ' ')
echo "⚠ $repo ($branch): $count uncommitted files"
fi
fi
done
Thresholds:
Check if the watcher is sending heartbeats to Axiom:
# Check watcher log for recent Axiom sends
grep -c "axiom\|telemetry\|heartbeat" /tmp/brainlayer-watcher.log 2>/dev/null || echo "0"
# Check last heartbeat timestamp
grep -i "heartbeat\|axiom" /tmp/brainlayer-watcher.log 2>/dev/null | tail -1
# If no watcher log, check if Axiom env vars are configured
[ -n "$AXIOM_TOKEN" ] && echo "Axiom token: configured" || echo "YELLOW: AXIOM_TOKEN not set"
[ -n "$AXIOM_DATASET" ] && echo "Axiom dataset: $AXIOM_DATASET" || echo "YELLOW: AXIOM_DATASET not set"
Expected state:
Stale PRs rot. Check what's open:
for repo in EtanHey/golems EtanHey/brainlayer EtanHey/voicelayer EtanHey/orchestrator EtanHey/cmuxlayer; do
prs=$(gh pr list --repo $repo --state open --json number,title,createdAt --jq 'length' 2>/dev/null)
if [ "$prs" = "0" ] || [ -z "$prs" ]; then
echo "✓ $repo: no open PRs"
else
echo "⚠ $repo: $prs open PR(s)"
gh pr list --repo $repo --state open --json number,title,createdAt --jq '.[] | " #\(.number) \(.title) (\(.createdAt | split("T")[0]))"' 2>/dev/null
fi
done
Thresholds:
After running all checks, produce a structured report:
# Ecosystem Health Report — YYYY-MM-DD HH:MM
## Overall: GREEN / YELLOW / RED
### Infrastructure
| Check | Status | Value | Notes |
|-------|--------|-------|-------|
| BrainLayer MCP | ✅ | 297K chunks | <1s response |
| VoiceLayer MCP | ✅ | connected | daemon mode |
| cmux MCP | ✅ | connected | N surfaces |
| exa MCP | ✅ | available | — |
| supabase MCP | ✅ | available | — |
| Voice Bar | ✅ | running | socket OK |
| VoiceLayer daemon | ✅ | running | socket OK |
| JSONL watcher | ✅ | running | 0 inbox files |
| Enrichment | ✅ | idle | 0 pending |
### Data
| Check | Status | Value | Notes |
|-------|--------|-------|-------|
| BrainLayer search | ✅ | responsive | results returned |
| WAL size | ✅ | 0 MB | clean |
| Axiom telemetry | ✅ | flowing | last heartbeat 2m ago |
### Development
| Check | Status | Value | Notes |
|-------|--------|-------|-------|
| Git repos clean | ⚠ | 2/5 dirty | golems, orchestrator |
| Open PRs | ✅ | 0 total | — |
### Actions Needed
1. [specific action if any]
After generating, store in BrainLayer:
brain_store(
content: "Ecosystem health YYYY-MM-DD: [overall status]. [summary of findings]. [actions needed]",
tags: ["health-check", "ecosystem", "maintenance"],
importance: 6
)
Everything in Quick Check, plus:
python3 ~/Gits/orchestrator/scripts/friction-scan.py --threshold 5
Compare against previous scan. Look for new friction categories, recurring patterns, trending up/down.
Pick 3-5 skills and verify their evals exist and pass:
for skill in coach pr-loop commit research cmux-agents; do
echo "=== $skill ==="
ls ~/Gits/orchestrator/skill-evals/$skill/ 2>/dev/null || echo "NO EVALS"
done
Run 3 known-good queries and verify they return expected results:
brain_search("component reasoning brainlayer")
brain_search("friction patterns coachClaude")
brain_search("orchestrator architecture golems")
If any return empty or irrelevant results, search quality has degraded.
for repo in golems brainlayer voicelayer orchestrator cmuxlayer; do
last=$(git -C ~/Gits/$repo log -1 --format="%ar" 2>/dev/null || echo "not found")
echo "$repo: $last"
done
Repos with no activity > 2 weeks during active development = investigate.
ls -la ~/.claude/hooks/brainlayer-*.py 2>/dev/null || echo "No BrainLayer hooks"
cat ~/.claude/settings.json | python3 -m json.tool 2>/dev/null | grep -A5 "hooks" || echo "No hooks in settings"
Verify: SessionStart and UserPromptSubmit hooks are wired. No PostToolUse hooks (those cause hangs).
tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).