skills/golem-powers/freeze-detect/SKILL.md
Use when supervising cmux or similar agent surfaces that look unchanged, quiet, or token-frozen. Distinguishes stale parsed telemetry from genuinely idle workers by rotating one full read onto the worst offender, requiring prompt proof before calling a surface idle, and parking monitor loops around known long-running operations. Triggers on: parsed_only, frozen screen, idle codex, no token movement, stuck worker, long-running build, long-running test.
npx skillsauth add etanhey/golems freeze-detectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Repeated partial telemetry is a suspicion, not a verdict. Escalate one surface to a full read, then reason from what you actually saw.
parsed_only output keeps repeatingparsed_only=True is compressed telemetry. It can repeat the same wrapper while the underlying surface is still active. A loop that treats repeated parsed snippets as truth will:
/monitor-loop counter discipline by resetting or parking for the wrong reasonThis skill forces a narrow verification path: escalate one suspect surface per tick, read the full screen, and prove idleness before you say it.
When monitoring N surfaces, keep lightweight telemetry on all of them, but escalate only one surface to a full read per tick.
Required fields:
parsed_only_signatureconsecutive_matching_parsed_tickslast_full_read_timelast_full_read_summarylast_known_long_running_opidle_candidate_sinceTick shape:
1. Gather parsed-only snapshots for all monitored surfaces.
2. Identify which surfaces have matching parsed signatures across repeated ticks.
3. Select the single worst offender:
- highest `consecutive_matching_parsed_ticks`
- oldest `last_full_read_time`
- highest operational risk if misclassified
4. Run exactly one full read on that surface this tick.
5. Classify it as:
- active
- idle-candidate
- long-running
- unknown-needs-recheck
6. Carry the result back into `/monitor-loop` without declaring success from telemetry alone.
If parsed_only output matches across repeated ticks, do not call the surface idle.
Action:
This is the anti-spam rule. Five suspicious surfaces do not justify five full reads in one cycle.
If the full read shows any of these:
Then the surface is active, not idle.
Action:
idle_candidate_since/monitor-loop focused on the queue, not on poking the workerIf the full read shows the worker is in a known multi-minute operation such as:
Then classify it as long-running.
Action:
15m re-check interval before the next full read unless another stronger signal arrivesThe point is to reduce monitor churn while preserving the worker's runway.
A quiet full read is still not enough to declare idleness. Only declare idle when both conditions hold:
›>$60sIf either condition is missing, the surface is only an idle-candidate.
Action:
When several surfaces look frozen, escalate only the single worst offender to a full read that tick.
Do not:
No prompt at the bottom of the screen means no idle verdict.
Acceptable proof:
› at the bottom after an agent turn finishes> at the shell prompt$ at the shell promptUnacceptable proof:
Token counts can stall while tools keep working. Tool calls, subprocesses, and long-running commands often do not produce billable token movement.
Therefore:
Do not interrupt or re-dispatch a worker just because the screen is stable during a build or test.
Instead:
15mIf a full read cannot prove idle or active state, say unknown-needs-recheck.
/never-fabricate applies here: repeated telemetry, wrapper text, and token counters are not evidence strong enough to claim a worker is idle.
When multiple surfaces match parsed-only telemetry, rank them by:
consecutive_matching_parsed_tickslast_full_read_timeExamples of high-risk surfaces:
| Skill | How it composes |
|---|---|
| /monitor-loop | Supplies the tick state machine and ensures freeze checks still end in dispatch, verify-and-decrement, or park |
| /never-fabricate | Prevents treating parsed-only wrappers, token counts, or silence as proof of idleness |
| Anti-pattern | Why it fails | Fix | |---|---|---| | 15 parsed reads for 5 surfaces across 3 ticks | Repeats low-signal telemetry and still learns nothing | Rotate one full read per tick onto the worst offender | | "Token count hasn't moved, so the codex is idle" | Tool work may continue without token movement | Use token freeze as a hint, then read the full surface | | Prompt appears once in the middle of the screen | Mid-screen prompt fragments are not bottom-of-screen idle proof | Require prompt indicator at the bottom plus 60s identical full reads | | Build output unchanged for 2 minutes, so worker is frozen | Stable build/test output is often normal | Mark long-running, park monitor branch, re-check in 15m | | Parsed-only wrapper says idle | Wrapper text is telemetry, not truth | Full read and classify from actual screen evidence |
When a loop or monitor policy violates this skill:
15m re-check.tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).