tools/hermes/fleet_update/SKILL.md
Walk the Crane fleet, assess host health, apply safe updates, and file issues for anything needing human judgment. Runs weekly on mini via a systemd timer (see tools/hermes/systemd/). Posts a machine-source snapshot to crane-context's fleet_health_findings table (the same ingest pipeline the weekly GitHub-state audit uses).
npx skillsauth add venturecrane/crane-console fleet_updateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are the fleet update orchestrator. Your job is to SSH into every crane dev machine, classify what's pending, apply the safe fixes, and file GitHub issues for anything that needs human judgment.
Canonical plan: ~/.claude/plans/cuddly-riding-sifakis.md (#657 in
venturecrane/crane-console).
Systemd timer fleet-update.timer fires this skill weekly (Sunday 07:20
local, ±15min jitter) on the mini box only. You can also trigger
manually with systemctl start fleet-update.service.
Read from /etc/fleet-update/fleet-update.env:
FLEET_UPDATE_APPLY (bool, default false) — apply-gate. False =
classify only, don't execute. Two-week canary period after initial
rollout. Captain flips to true after validating classifications.CRANE_ADMIN_KEY — X-Admin-Key for POST /admin/fleet-health/ingest.CRANE_CONTEXT_BASE — e.g. https://crane-context.automation-ab6.workers.dev.GH_TOKEN — scoped PAT for gh issue create|edit in venturecrane/crane-console.Repo state: the systemd unit ExecStartPre= has already done
git fetch && git reset --hard origin/main on /srv/crane-console, so
the orchestrator tools, suppression list, and machine-health.sh on
disk are exactly origin/main. Record the SHA via
git -C /srv/crane-console rev-parse HEAD and put it in every finding's
extra.source_sha so stale-code drift is visible in the ingested data.
The machine list + SSH users is in scripts/setup-ssh-mesh.sh
(lines 75–82) or tools/hermes/fleet_update/machines.yaml if present.
Fields you need per machine: alias, tailscale_ip, ssh_user, role.
mini itself runs commands locally (no SSH). All others use Tailscale
SSH. The canonical cross-user pairing is smdurgan@mini (executor) →
scottdurgan@<alias> (targets).
Read tools/hermes/fleet_update/suppressions.yaml. Format:
- machine: mac23
types: ['*'] # never auto-apply anything on mac23
reason: 'captain workstation'
- machine: mbp27
types: ['brew-outdated']
reason: 'manually managed tooling'
Wildcard * applies to every finding type. Classification still runs
and findings still ingest — only apply is gated.
Run machine-health.sh --quick --json (locally for mini, over SSH for
others). Wrap SSH with bash -lc so macOS targets load .zprofile for
brew PATH:
ssh -o BatchMode=yes -o ConnectTimeout=10 scottdurgan@<alias> \
'bash -lc "~/dev/crane-console/scripts/machine-health.sh --quick --json"'
Parse the JSON. Fields of interest: os_security, os_updates,
brew_outdated, reboot_required, uptime_days, xcode_clt_outdated,
disk (string like "87%").
If a machine is unreachable, emit a single preflight-fail finding and
move on. Do not retry within a run.
Every non-zero signal becomes a candidate finding. Classify as safe-auto vs needs-human:
| Finding type | Default classification |
| ------------------------------------------- | ---------------------- |
| os-security-patches (Linux security-only) | safe-auto |
| brew-outdated (≤ 20 formulae, no casks) | safe-auto |
| os-feature-updates (macOS feature/major) | needs-human |
| reboot-required | needs-human |
| xcode-clt-outdated | needs-human |
| uptime-high (> 30 days) | needs-human |
| disk-pressure (> 90%) | needs-human |
| preflight-fail / unreachable | needs-human |
Classification can use judgment. Prefer needs-human when ambiguous. Never auto-apply anything that could require a reboot.
For each candidate safe-auto finding:
FLEET_UPDATE_APPLY=false: skip apply, keep classification.suppressions.yaml matches this (machine, type) or (machine, *):
skip apply, note auto_applied: false, apply_skipped: "suppressed:<reason>".ssh <user>@<alias> 'sudo unattended-upgrade -d'
(quiet success expected — unattended-upgrades package is the floor
and this just nudges it).ssh <user>@<alias> 'bash -lc "brew upgrade --quiet"'
— no casks, no --greedy.Record per finding: extra.auto_applied (bool), extra.apply_exit_code,
extra.apply_output_tail (last ~20 lines).
One POST per run, not per machine:
{
"org": "venturecrane",
"timestamp": "<ISO8601 now>",
"status": "pass|fail",
"source": "machine",
"findings": [
{
"repo": "machine/mini",
"rule": "os-security-patches",
"severity": "warning",
"message": "3 security updates pending (applied)",
"extra": {
"classification": "safe-auto",
"auto_applied": true,
"apply_exit_code": 0,
"apply_output_tail": "...",
"source_sha": "<git SHA>",
"apply_mode": "apply"
}
}
]
}
POST to ${CRANE_CONTEXT_BASE}/admin/fleet-health/ingest with header
X-Admin-Key: ${CRANE_ADMIN_KEY}. Expect HTTP 200. Non-200 is a hard
failure — log and exit non-zero; do not retry within the run (next
week's run reconciles).
The source: "machine" discriminator is load-bearing. Without it,
the ingest endpoint would auto-resolve open GitHub findings using this
snapshot. See migration 0037 and ingestFleetHealth (workers/crane-
context/src/fleet-health.ts) for the scoped-resolve contract.
For each needs-human finding, upsert a GitHub issue in
venturecrane/crane-console:
[fleet] <alias>: <finding_type>
(e.g. [fleet] mac23: reboot-required).fleet:<alias>, type:patch.[fleet] <alias>: <type> issue is no longer in
the current snapshot, close it with a comment ("resolved by next
snapshot at <timestamp>"). Match by title to avoid touching issues
the Captain filed manually.POST to ${CRANE_CONTEXT_BASE}/schedule/fleet-machine-check/complete
(X-Relay-Key auth) with a one-line summary. Makes SOS's Cadence block
show a fresh last_completed_at.
Format: fleet-update: N machines, K applied, M issues, P failures.
systemd captures this in /var/log/fleet-update/run.log.
preflight-fail, continue.extra.apply_failed=true and filing the issue.fleet_health.crane_doc('global', 'fleet-ops.md') — fleet architecture + safety rules.workers/crane-context/src/fleet-health.ts — ingest DAL.scripts/machine-health.sh — per-machine data collection (JSON mode).scripts/bootstrap-unattended-upgrades.sh — Linux security floor (Phase A).tools
Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layers like Lobster, ACPX, plugins, or plain code. Keep conditional logic in the caller; use TaskFlow for flow identity, child-task linkage, waiting state, revision-checked mutations, and user-facing emergence.
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
# Lobster Lobster executes multi-step workflows with approval checkpoints. Use it when: - User wants a repeatable automation (triage, monitor, sync) - Actions need human approval before executing (send, post, delete) - Multiple tool calls should run as one deterministic operation ## When to use Lobster | User intent | Use Lobster? | | ------------------------------------------------------ | --------------------------
tools
A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.