skills/investigating-kilroy-runs/SKILL.md
To diagnose active, stuck, or failed Kilroy Attractor runs, inspect run artifacts (`manifest.json`, `live.json`, `checkpoint.json`, `final.json`, `progress.ndjson`), resolve run IDs/log roots, identify model/provider routing, and isolate failure causes. Includes CXDB operations for launching/probing CXDB, opening the CXDB UI, and querying run context turns. This skill is useful when investigating run status, debugging retries/failures, explaining model usage, or inspecting CXDB-backed event history.
npx skillsauth add danshapiro/kilroy investigating-kilroy-runsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
To inspect a run quickly and produce a precise diagnosis, follow this workflow.
--logs-root path when provided.~/.local/state/kilroy/attractor/runs.RUN_ROOT.RUNS="$HOME/.local/state/kilroy/attractor/runs"
RUN_ID="$(find "$RUNS" -mindepth 1 -maxdepth 1 -type d -printf '%T@ %f\n' | sort -nr | awk 'NR==1 {id=$2} END {print id}')"
RUN_ROOT="$RUNS/$RUN_ID"
echo "$RUN_ID"
For a quick check of the newest run without manually resolving RUN_ROOT, use:
./kilroy attractor status --latest --json
To get CXDB connection details, start with manifest.json, then fall back to run_config.json and live artifacts when fields are missing:
CXDB_URL="$(jq -r '.cxdb.http_base_url // empty' "$RUN_ROOT/manifest.json")"
CONTEXT_ID="$(jq -r '.cxdb.context_id // empty' "$RUN_ROOT/manifest.json")"
if [ -z "$CXDB_URL" ] && [ -f "$RUN_ROOT/run_config.json" ]; then
CXDB_URL="$(jq -r '.cxdb.http_base_url // empty' "$RUN_ROOT/run_config.json")"
fi
if [ -z "$CONTEXT_ID" ] && [ -f "$RUN_ROOT/live.json" ]; then
CONTEXT_ID="$(jq -r '.context_id // empty' "$RUN_ROOT/live.json")"
fi
if [ -z "$CONTEXT_ID" ] && [ -f "$RUN_ROOT/checkpoint.json" ]; then
CONTEXT_ID="$(jq -r '.context_id // empty' "$RUN_ROOT/checkpoint.json")"
fi
echo "cxdb_url=$CXDB_URL context_id=$CONTEXT_ID"
To make sure CXDB is available and to print the UI endpoint, run:
./scripts/start-cxdb.sh
UI_LINE="$(./scripts/start-cxdb-ui.sh)"
echo "$UI_LINE" # prints: cxdb_ui=http://...
CXDB_UI="${UI_LINE#cxdb_ui=}"
Use the endpoint printed by start-cxdb-ui.sh (cxdb_ui=...) as the source of truth. To open the UI in a browser when needed, run:
KILROY_CXDB_OPEN_UI=1 ./scripts/start-cxdb-ui.sh
To follow run events directly from CXDB:
./kilroy attractor status --logs-root "$RUN_ROOT" --follow --cxdb
./kilroy attractor status --logs-root "$RUN_ROOT" --follow --cxdb --raw
To run direct HTTP queries for ad-hoc debugging, use:
# Health endpoint may be /healthz even when /health returns 404.
curl -fsS "$CXDB_URL/health" || curl -fsS "$CXDB_URL/healthz"
curl -fsS "$CXDB_URL/v1/contexts"
curl -fsS "$CXDB_URL/v1/contexts/$CONTEXT_ID"
curl -fsS "$CXDB_URL/v1/contexts/$CONTEXT_ID/turns?limit=20"
curl -fsS "$CXDB_URL/v1/contexts/$CONTEXT_ID/turns?view=typed&limit=20"
To build a reliable picture of run state:
Always inspect graph.dot first so status is interpreted in graph context.
Preflight-only runs (--preflight / --test-run) are expected to write preflight_report.json and skip execution artifacts (manifest.json, checkpoint.json, final.json, worktree/).
manifest.json: run identity, graph name, repo, worktree, started_at.live.json: most recent event.checkpoint.json: last completed node and failure context.final.json: if present, run is finished (success or fail).progress.ndjson: full event timeline.sed -n '1,200p' "$RUN_ROOT/graph.dot"
sed -n '1,200p' "$RUN_ROOT/manifest.json"
sed -n '1,200p' "$RUN_ROOT/live.json"
[ -f "$RUN_ROOT/checkpoint.json" ] && sed -n '1,200p' "$RUN_ROOT/checkpoint.json"
[ -f "$RUN_ROOT/final.json" ] && sed -n '1,200p' "$RUN_ROOT/final.json"
tail -n 80 "$RUN_ROOT/progress.ndjson"
To classify run state:
final.json missing and live.json/progress.ndjson still changing.final.json present.progress.ndjson updates for longer than configured stall timeout.attractor status can show terminal fail while progress.ndjson still advances in overlap conditions; timestamp comparison resolves this ambiguity.To quickly validate terminal state vs liveness, use:
ls -la "$RUN_ROOT/final.json"
tail -n 1 "$RUN_ROOT/progress.ndjson"
To avoid filtering for non-existent event keys, discover the active event schema before building event-specific queries:
jq -r '.event? // empty' "$RUN_ROOT/progress.ndjson" | sort | uniq -c | sort -nr
To inspect fields for a specific event type, run:
jq -c 'select(.event=="stage_attempt_end")' "$RUN_ROOT/progress.ndjson" | head -n 5
These patterns come from real investigation mistakes where the first query produced noisy output and the replacement query produced useful signal.
# noisy
tail -n 80 "$RUN_ROOT/progress.ndjson"
# higher-signal
jq -rc 'select(.event!="branch_heartbeat") | {ts,event,node_id,status,branch_key,branch_event,branch_status,branch_failure_reason}' \
"$RUN_ROOT/progress.ndjson" | tail -n 80
# broad but low immediate diagnostic value
jq -r '.event? // empty' "$RUN_ROOT/progress.ndjson" | sort | uniq -c | sort -nr
# better for current state
jq -rc 'select(.event!="branch_heartbeat") | {ts,event,node_id,status,branch_key,branch_event,branch_status,branch_failure_reason}' \
"$RUN_ROOT/progress.ndjson" | tail -n 40
# noisy across repo history
git rev-list HEAD | head -n 200
# scoped to this run's commits
git log --oneline --grep "$RUN_ID" -n 80
target/ and artifact churn.# noisy if commit touched build outputs
git show --stat <commit>
# source-focused
git show --stat <commit> -- demo/rogue/rogue-wasm/src
# brittle in strict pipefail shells
set -euo pipefail
jq -rc 'select(.event!="branch_heartbeat")' "$RUN_ROOT/progress.ndjson" | head -n 20
# safer wrapper for quick probes
set -euo pipefail
jq -rc 'select(.event!="branch_heartbeat")' "$RUN_ROOT/progress.ndjson" | head -n 20 || true
# activity-only view
git log --oneline --grep "$RUN_ID" -n 30
# source progress since last meaningful implementation commit
git diff --stat <last_meaningful_commit>..HEAD -- demo/rogue/rogue-wasm/src
To handle cases where final.json exists but progress.ndjson still changes:
final.json exists and progress.ndjson is newer, this is a possible overlapping-resume state rather than immediate data corruption.run/resume process is currently active.stat -c '%n %y' "$RUN_ROOT/final.json" "$RUN_ROOT/live.json" "$RUN_ROOT/progress.ndjson" 2>/dev/null
tail -n 5 "$RUN_ROOT/progress.ndjson"
To determine whether the run is truly active at the OS level:
pgrep -af 'kilroy attractor (run|resume)'
[ -f "$RUN_ROOT/run.pid" ] && cat "$RUN_ROOT/run.pid"
[ -f "$RUN_ROOT/run.pid" ] && ps -fp "$(cat "$RUN_ROOT/run.pid")"
ps -ef | rg -i 'kilroy attractor (run|resume)' | rg -v rg
If a resume process is already active for the same --logs-root, launching another resume is a possible source of mixed terminal/live state.
A live PID with unchanged tail events across repeated checks is a possible stale/hung process, not active run progress.
E1="$(tail -n 1 "$RUN_ROOT/progress.ndjson")"
sleep 3
E2="$(tail -n 1 "$RUN_ROOT/progress.ndjson")"
[ "$E1" = "$E2" ] && echo "no new events" || echo "events advancing"
When relaunching, quiescing duplicate resume processes first and launching one detached resume reduces the chance of stopped by signal terminated outcomes.
ps -ef | rg -i "kilroy attractor resume --logs-root $RUN_ROOT" | rg -v rg
# If duplicates exist, stop extras before launching a single detached resume.
setsid -f bash -lc "cd /path/to/repo && ./kilroy attractor resume --logs-root '$RUN_ROOT' >> '$RUN_ROOT/resume.out' 2>&1"
To diagnose fan-in waits, inspect branch-local progress under parallel/<join-node>/:
find "$RUN_ROOT/parallel" -maxdepth 3 -type f -name 'progress.ndjson' | sort
To inspect branch outcomes and status-contract behavior:
rg -n 'status_contract|stage_attempt_end|stage_retry_blocked|deterministic_failure_cycle_check|subgraph_deterministic_failure_cycle_check' \
"$RUN_ROOT"/parallel/*/*/progress.ndjson
To interpret heartbeat-only behavior:
branch_heartbeat and branch_idle_ms rises monotonically, the branch is likely waiting/stalled rather than converging.branch_progress continues with new stage_attempt_* events, the branch is still making progress.tail -n 200 "$RUN_ROOT/progress.ndjson" | rg 'branch_heartbeat|branch_progress|branch_idle_ms'
To distinguish policy stops from compute hangs, inspect guardrail events directly:
rg -n 'stuck_cycle_breaker|deterministic_failure_cycle_breaker|stage_retry_blocked|deterministic_failure_cycle_check' \
"$RUN_ROOT/progress.ndjson"
Interpretation guidance:
stuck_cycle_breaker with visit_count/visit_limit indicates a configured loop-visit stop.*_cycle_breaker indicates repeated deterministic failures reached the configured signature limit.To identify models/providers accurately, combine static and runtime evidence:
model_stylesheet and node classes.progress.ndjson (llm_retry, llm_call_*, provider/model fields).run_config.json.rg -n 'model_stylesheet|llm_model|llm_provider|class=' "$RUN_ROOT/graph.dot"
rg -n '"event":"llm_|"provider":"|"model":"' "$RUN_ROOT/progress.ndjson"
sed -n '1,220p' "$RUN_ROOT/run_config.json"
missing status.json, check whether the codergen node emitted the required status signal.llm retry with 429/rate-limit, check for provider quota or backoff pressure.deterministic_failure_cycle_check, check for repeated deterministic failure at the same node.setup_command_* events and stage stderr.log.rg -n 'missing status.json|llm_retry|deterministic_failure_cycle_check|setup_command_|failure_reason' "$RUN_ROOT/progress.ndjson"
Capture final.json timestamp, latest progress.ndjson timestamp, active resume PIDs/PPIDs, and termination events (stopped by signal terminated, subgraph_canceled_exit, stage_attempt_end).
stat -c '%y %n' "$RUN_ROOT/final.json" "$RUN_ROOT/live.json" "$RUN_ROOT/progress.ndjson" 2>/dev/null
ps -ef | rg -i "kilroy attractor resume --logs-root $RUN_ROOT" | rg -v rg
rg -n 'stopped by signal terminated|subgraph_canceled_exit|stage_attempt_end' "$RUN_ROOT/progress.ndjson" | tail -n 80
To present findings clearly, report in this order:
run_id, run_root, started time.tools
Operate Kilroy Attractor pipelines end-to-end: ingest English requirements into DOT graphs, validate graph semantics, run and resume pipelines with run config files, configure provider backends (cli/api), and debug runs from logs_root artifacts and checkpoints.
tools
Use when bootstrapping a new project repository for Kilroy Attractor from a clean directory using existing spec, DoD, graph, and run config artifacts.
development
Use when preparing a Kilroy release — writing release notes, tagging, and publishing via goreleaser on GitHub.
development
Use when authoring or repairing Kilroy run config YAML/JSON files, including DOT-to-provider backend alignment and runtime policy defaults.