toolkit/packages/skills/swarm-agent-troubleshooting/SKILL.md
Diagnose and fix swarm agent spawn failures when agents don't start processing tasks
npx skillsauth add stevengonsalvez/agents-in-a-box swarm-agent-troubleshootingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnose and fix swarm agent spawn failures when agents don't start processing tasks.
wait_for_claude_ready() times out# 1. List swarm sessions
tmux list-sessions | grep swarm
# 2. Attach to stuck agent
tmux attach -t swarm-XXXX-agent-1
# 3. Check what's showing:
# - Shell prompt (zsh/bash) = Claude didn't start
# - "dotenv: found '.env'" = Blocked by prompt
# - Claude banner = Ready (check if task was sent)
| Symptom | Cause | Fix |
|---------|-------|-----|
| dotenv: found '.env' file. Source it? | dotenv plugin prompt | Send a (always) then restart claude |
| Shell prompt showing | Claude command never ran | Manually run claude --dangerously-skip-permissions |
| Claude banner but idle | Task not sent after spawn | Resend the task prompt |
| "command not found: claude" | Claude not in PATH | Check which claude or use full path |
# Step 1: Attach to the stuck session
tmux attach -t swarm-XXXX-agent-1
# Step 2: If at dotenv prompt, answer it
# (type 'a' for always, then Enter)
# Step 3: If at shell, start Claude manually
claude --dangerously-skip-permissions
# Step 4: Once Claude is ready, send the task
# (paste the agent's task prompt)
# Step 5: Detach (Ctrl+B, D) and continue monitoring
Add to ~/.zshrc or ~/.bashrc:
# Auto-accept dotenv in non-interactive contexts
export DOTENV_ASSUME_YES=1
In spawn-agent-lib.sh, increase the timeout:
wait_for_claude_ready "$SESSION" 60 # Increase from 30 to 60 seconds
The wait_for_claude_ready() function looks for these patterns:
If your terminal has different output, add patterns to the grep regex.
Failed spawn attempts save debug output:
cat /tmp/spawn-agent-swarm-XXXX-agent-1-failure.log
/swarm-status - Check overall swarm health/swarm-shutdown - Clean up stuck swarmspawn-agent-lib.sh - Core spawning librarydocumentation
Report reflect drain spend over a time window — tokens split by cached (cache_read), uncached writes (cache_creation), and io (input+output), with a $ estimate, grouped by day / outcome / model / transcript. Reads the drainer's cost log and surfaces outlier runs and cache-reuse health (the 41.5M-token failure mode = low cache reuse + high cache writes). Use to answer "what is reflection costing me" for the last day / week.
development
Show fleet status — every claude session running on the host, merged across ainb + claude-peers broker + background jobs. Use when you need to enumerate sessions before composing an action, see which sessions have a peer registered (broker-routable) vs tmux-only, check the `summary` of each session, or pipe the list into jq for filtering. Default output: text table. Pass --format json for LLM consumption.
testing
Ordered multi-step prompts to fleet targets, ack-gated between steps via JSONL assistant-turn-end detection. Use for cycles like disconnect→reconnect→verify, or any flow where step N+1 requires step N to have completed first. The skill BLOCKS until each target's transcript shows the next assistant turn finishing OR per-step timeout fires (default 300s).
development
Center control panel — enumerate every claude session that is blocked waiting on something: a user answer (AskUserQuestion fired), an API error retry, an idle assistant turn-end with no follow-up, or an explicit WAITING: marker. Returns rich JSON with signal kind + context per session. Use this when you've stepped away from the fleet and want one place to see everything that wants your attention and answer it.