skills/fleet-management/SKILL.md
Rolling deployment strategies, multi-device coordination, and rollback triggers for edge device fleets. Use when managing fleet-wide deployments, configuring rollout strategies, building device registries, or implementing rollback automation.
npx skillsauth add michaelalber/ai-toolkit fleet-managementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
"In a fleet of a thousand devices, you do not fear the one that fails -- you fear the nine hundred and ninety-nine that fail silently." -- Kelsey Hightower, Principal Engineer, Google
This skill provides the operational knowledge for managing deployments across fleets of heterogeneous edge devices. It covers rolling deployment strategies, device registry management, health-gated rollouts, and automatic rollback triggers. Every pattern assumes edge devices are remote, resource-constrained, and potentially unreliable.
Non-Negotiable Constraints:
| Principle | Description | Priority | |-----------|-------------|----------| | Canary First | Every deployment begins with a canary subset; never skip canary even for hotfixes | Critical | | Health-Gated Waves | Each rollout wave must pass health checks before the next wave begins | Critical | | Rollback Independence | Rollback mechanism must work even if the new version is completely non-functional | Critical | | Device Registry Accuracy | Maintain up-to-date inventory of device capabilities, versions, and health status | High | | Offline Tolerance | Gracefully handle devices offline during deployment; catch them up later | High | | Percentage-Based Rollout | Define rollout stages as fleet percentages, not absolute device counts | High | | Automatic Rollback Triggers | Define measurable failure thresholds that trigger rollback without human intervention | High | | Deployment Atomicity | A deployment to a single device either fully succeeds or fully rolls back; no partial states | Medium | | Heterogeneous Fleet Support | Support mixed device types (Jetson, RPi, gateways) in a single coordinated deployment | Medium | | Audit Trail | Every deployment action must be logged with timestamp, device ID, actor, and outcome | Medium |
| Query | When to Call |
|-------|--------------|
| search_knowledge("rolling deployment canary staged rollout") | During PREPARE/CANARY — selecting and sizing deployment waves |
| search_knowledge("health check liveness readiness probe") | During VALIDATE — designing application-aware health checks |
| search_knowledge("blue-green deployment rollback strategy") | During CANARY/ROLLOUT — choosing and configuring rollback mechanisms |
| search_knowledge("edge device fleet OTA update") | During PREPARE — understanding OTA update constraints for embedded devices |
| search_knowledge("device registry inventory management") | During PREPARE — structuring the device registry schema |
| search_code_examples("Docker container rollback Python") | Before writing rollback automation |
| search_code_examples("health endpoint Flask FastAPI") | Before implementing health endpoints |
Search automation and architecture collections for fleet coordination patterns; edge_ai for Jetson-specific deployment notes.
The deployment lifecycle flows: PREPARE → VALIDATE → CANARY → (human approval) → WAVE 1 → WAVE 2 → WAVE 3 → CONFIRM. Health gates between every wave. Rollback at any phase returns to PREPARE.
| Strategy | Best For | Tradeoff | Risk Level | |----------|----------|----------|------------| | Canary + Rolling | Most edge fleets | Balanced speed and safety | Low | | Blue-Green | Fleets with hot-standby capacity | Fast rollback, double resources | Low | | Rolling Update | Homogeneous fleets with stateless apps | Simple, no extra resources | Medium | | A/B Deploy | Feature testing across device subsets | Complex routing, useful metrics | Medium | | Big Bang | Never for edge fleets | — | Unacceptable |
If ANY item is unchecked — STOP. Resolve before deploying.
Canary (1–5% of fleet):
Staged Waves:
Automatic Rollback Triggers:
Requires two deployment slots per device (BLUE = active, GREEN = standby). Deploy new artifact to GREEN on canary devices → verify → switch canary traffic BLUE→GREEN → deploy GREEN to remaining fleet in waves → switch traffic after each wave verification. Rollback: switch traffic back from GREEN to BLUE — seconds, no file transfer needed. Use when: devices have sufficient resources for two slots, zero-downtime deployment is required, instant rollback is a hard requirement.
Layer 1 Connectivity — ICMP ping, SSH port open, deployment agent heartbeat
Layer 2 System Health — CPU/memory/disk below thresholds, temperature below thermal limit
Layer 3 Application — Health endpoint 200, version matches expected, no crash loops
Layer 4 Functional — Correct inference output on test input, E2E latency within bounds
Health endpoint response: {"status": "healthy|degraded|unhealthy", "version": "...", "checks": {...}}
Snapshot-based: Snapshot filesystem/image before deploy → store locally with checksum → on failure: stop new app, restore snapshot, verify health.
Dual-slot: /opt/app/active symlink → slot-a (previous known-good) or slot-b (new version). Rollback = update symlink to slot-a, restart. Seconds to complete, no transfer needed.
Container-based: docker tag app-current app-rollback before deploy. Rollback: docker stop && docker rm app-current && docker run app-rollback.
<fleet-deploy-state>
phase: [PREPARE | CANARY | VERIFY | ROLLOUT | CONFIRM]
strategy: [canary-rolling | blue-green | rolling-update]
artifact: [name and version]
fleet_total: N
deployed_count: N
healthy_count: N
quarantined_count: N
skipped_count: N
rollback_available: [true | false]
current_wave: [N/M]
last_action: [description]
next_action: [description]
blockers: [any issues]
</fleet-deploy-state>
## Fleet Deployment Report: [Artifact] v[version]
**Strategy**: [strategy] | **Duration**: [start] to [end]
| Status | Count | % |
|--------|-------|---|
| Deployed (healthy) | N | % |
| Skipped (unreachable) | N | % |
| Quarantined (failed) | N | % |
| Wave | Devices | Failed | Soak | Verdict |
|------|---------|--------|------|---------|
| Canary | N | N | [duration] | PASS/FAIL |
| Wave 1/2/3 | N | N | [duration] | PASS/FAIL |
**Health Delta**: Error rate [+/-], P95 latency [+/-], CPU [+/-], Memory [+/-]
| Anti-Pattern | Why It's Wrong | Correct Approach | |--------------|----------------|------------------| | Deploying to all devices at once | A single bug bricks the entire fleet; recovery takes weeks | Use canary + staged waves with health gates | | Skipping canary for "small" changes | Small changes cause production incidents too; one-line bugs exist | Always canary, regardless of change size | | Health checks that only ping | A device can respond to ping while serving garbage results | Implement application-aware health checks | | No soak period between waves | Issues that take minutes to manifest (memory leaks, thermal) are missed | Enforce minimum soak periods | | Rollback that depends on the new version | If the new version crashes on startup, rollback fails too | Rollback must be independent of application health | | Treating offline devices as failed | Edge devices go offline legitimately | Track offline devices separately; catch them up later | | Manual rollback procedures | Under pressure, humans skip steps | Define automatic rollback triggers with measurable thresholds | | Deploying without a device registry | Cannot track what is deployed where, making rollback and auditing impossible | Maintain an accurate, up-to-date device registry |
Wave exceeds failure threshold: HALT current wave immediately. Rollback ALL devices in the failed wave. Verify rollback restores healthy state. Analyze failure pattern: same failure on all devices (artifact issue), specific device type (compatibility issue), random failures (infrastructure issue). Do NOT proceed until root cause is identified.
Canary shows gradual degradation: Extend soak period to confirm the trend. Capture detailed metrics (1-second intervals). If degradation continues: rollback canary, verify metrics return to baseline, report the pattern. Common causes: memory leak, resource contention, thermal throttling. Do NOT proceed to fleet rollout with gradual degradation.
Device registry out of sync: Run fleet-wide heartbeat scan. Compare against registry. Device in registry but not responding → mark OFFLINE. Device responding but not in registry → add to registry. Capability mismatch → update registry from device report. Do NOT deploy to devices with unresolved discrepancies.
jetson-deploy -- Use for Jetson-specific device-level configuration (TensorRT engine building, power mode, JetPack verification). Fleet management handles coordination; jetson-deploy handles device-level execution.sensor-integration -- When the fleet includes sensor payloads, coordinate sensor configuration alongside application deployment and re-validate calibration after software updates.edge-cv-pipeline -- Health checks for CV pipeline deployments should include inference accuracy validation, not just application liveness. Use edge-cv-pipeline patterns for functional health check definition.development
Federal / government security overlay applied ON TOP OF a base language security review (dotnet/python/php/rust/react). Language-agnostic: adds NIST SP 800-53 control mapping, FIPS 140-2/3 cryptographic compliance (with a per-language crypto table), CUI handling, EO 14028 supply-chain requirements, and DOE Order 205.1B, and emits POA&M-ready findings with FIPS 199 impact levels. Use for federal/DOE/DOD/national-laboratory systems. Triggers on "federal security review", "NIST compliance", "NIST 800-53", "FISMA", "CUI", "FIPS audit", "DOE security", "POA&M", "ATO review". Do NOT use alone — run the matching <lang>-security-review FIRST; this overlay maps and extends it.
tools
OWASP-based security review of React / TypeScript front-end applications. Detects the framework (Vite/CRA/Next), entry points, and data flows, scans against the OWASP Top 10 (2025) mapped to React client-side patterns (XSS via raw HTML, URL/protocol injection, secrets in the bundle, insecure token storage, dependency CVEs, missing CSP, open redirects), and produces a manager-friendly executive summary plus a graded technical findings table. Use to audit React code for vulnerabilities. Triggers on "react security review", "frontend security audit", "audit react for vulnerabilities", "owasp react", "react xss", "react security posture", "npm audit review". For federal / gov / DOE / NIST / FIPS / CUI context, run security-review-federal after this base review. Do NOT use to grade architecture/structure — use react-architecture-checklist.
tools
Analyzes legacy React codebases and produces actionable modernization plans. Primary migration paths include class components to function components + hooks, Create React App to Vite, React 16/17 to 18 to 19, JavaScript to TypeScript, Enzyme to React Testing Library, legacy Redux to Redux Toolkit / Zustand / Context, and deprecated lifecycle/API removal. Does NOT perform the migration — assesses, quantifies risk, and plans. Triggers on phrases like "modernize react", "class to hooks", "upgrade react", "migrate CRA to vite", "react legacy migration", "react 17 to 18", "react js to typescript", "react technical debt", "enzyme to RTL".
development
Scaffolds feature-based React / TypeScript architecture using feature folders, presentational + container components, custom hooks, a typed data layer, and structural CQRS (query hooks vs mutation hooks). React analog of dotnet-vertical-slice and python-feature-slice — no DI framework; uses props/context for dependency injection and a query cache for server state. Use when creating feature-based React projects, adding React features, organizing components by feature rather than by technical type, or scaffolding a feature's data layer. Triggers on phrases like "scaffold react feature", "create react slice", "react feature folder", "react vertical slice", "add react feature", "react feature architecture", "organize react by feature".