agentic/code/frameworks/security-engineering/skills/degraded-mode-design/SKILL.md
Decision-aid skill for fail-closed vs fail-open behavior. Forces a degraded-mode matrix and rejects "type Y to override" prompts in favor of multi-step ceremonies
npx skillsauth add jmagly/aiwg degraded-mode-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Decision aid for what a system does when something goes wrong: a key is missing, a verification fails, a sensor times out, a network partition is detected. Use when designing or reviewing any code path labeled "emergency", "fallback", "recovery", or "graceful degradation".
The skill exists because the wrong fallback can defeat the entire design. Review finding B4 ("emergency-lock copies VM disk unencrypted") is the canonical example: the fallback intended to preserve availability also preserved the assets in plaintext, defeating the encryption that was the entire point of the system.
The CIA triad (Confidentiality, Integrity, Availability) ranks differently for different systems:
| System type | Priority | Failure default | |---|---|---| | Security product (encryption, auth) | C > I > A | Fail closed: no operation rather than insecure operation | | Safety system (medical, aviation) | A > I > C | Fail safe: known-good degraded behavior | | Financial system | I > C > A | Fail closed for writes, log everything | | Communication infrastructure | A > I > C | Fail open: best effort, alert on issues |
For a security product, the default behavior on any failure is: stop, preserve confidentiality and integrity, alert the operator. Preserving availability of the secret operation when its protection is failing is the wrong trade.
This is the rule review finding B4 violated: when both YubiKeys were missing, the emergency-lock script copied the VM disk unencrypted to "preserve the session." The premise — losing session is worse than brief unencrypted exposure — is wrong for a security-first product. The whole point of the dual-YubiKey design was that no keys = no session.
# WRONG (review B4)
if both_keys_missing; then
cp vm.disk.encrypted /backup/vm.disk.plaintext # preserve session at cost of confidentiality
fi
# WRONG
echo "[!] Manifest verification failed."
read -p "Type Y to continue anyway: " ans
[ "$ans" = "Y" ] && exec ./unlock.sh
Operators learn to autoresponse-Y. The override defeats integrity exactly when integrity matters most: when something has changed.
# WRONG
if encryption_module_unavailable; then
log "encryption disabled, continuing"
write_plaintext "$@"
fi
Operators forget which mode they were in. Two minutes later, they expect ciphertext on disk, but it's plaintext.
If the operational path uses keys A+B and the recovery path uses keys A+B+C, then the recovery path is a strict superset of the operational path. An attacker with A+B has both. Recovery must use a separate key set (e.g., HQ escrow key, not held by the operator).
A system that needs both:
…must design two distinct paths that do not share key material:
Operational path:
factors {A, B, C} → unlock(...) → secret operation
Escrow path:
HQ-only key K_escrow → encrypt-to-K_escrow at provisioning
…
Recovery: HQ + K_escrow → re-derive operational state
Properties:
This is the remediation pattern for review finding B4.
Sometimes manual override is genuinely necessary (system migration, key rotation, legitimate emergency). When it is:
[!] Integrity check failed. Type Y to override:
A real override requires:
read -p "Override SHA-256 prefix (first 8 hex chars of failing file): " prefix
expected=$(sha256sum failing-file | cut -c1-8)
if [ "$prefix" != "$expected" ]; then
log "Override aborted: prefix mismatch"
exit 1
fi
log_external "Override initiated by ${USER} at $(date -Is)"
sleep 30
echo "Override active. Type CONFIRM or anything else to abort:"
read confirm
[ "$confirm" = "CONFIRM" ] || { log_external "Override aborted by user"; exit 1; }
Better: remove the override entirely and require returning to HQ. If you can't, document why and accept the residual risk explicitly.
Before declaring a security design complete, fill out:
| Trigger | What the system does | Cleanup of in-memory secrets | Operator notification | Recovery path | |---|---|---|---|---| | One factor lost | refuse operation; preserve at-rest state | clear RAM via tmpfs unmount | display + log | bring missing factor back | | All factors lost | refuse operation; shred any in-memory plaintext | clear RAM | display + log + HQ alert | escrow path (separate keys) | | Network lost mid-operation | complete current op offline if safe; defer next | n/a | display | wait for network | | Hardware failure (TPM/HSM) | refuse operation | clear RAM | display + HQ alert | replace hardware; re-provision | | Verification failure (signature, manifest) | refuse operation | clear RAM | display + HQ alert | investigate; do not override | | Time skew detected (HMAC TOTP) | small skew: accept; large skew: refuse | n/a | display | sync clock; investigate |
Every row must have an entry in every column. "TBD" means the design is incomplete.
When a degraded mode triggers, the system MUST clean up secrets it had in flight:
mlock+memset_explicit or mlock+exit (libsodium sodium_memzero); avoid signal-racesSee secret-handling-runtime skill for the full pattern; this skill just enforces that degraded-mode triggers invoke that cleanup.
Original:
When both YubiKeys are gone:
→ 31-usb-emergency-lock.sh copies VM disk UNENCRYPTED to LUKS volume
What this skill flags:
Remediation:
Original:
[!] Boot-time integrity check failed. Type Y to override:
What this skill flags:
Remediation:
When this skill is invoked as part of a review, produce findings in the standard format. When authoring a degraded-mode-matrix.md (template), drive the matrix and override-ceremony sections.
### Finding: <SHORT-NAME>
**Severity**: BLOCK | HIGH | MEDIUM | LOW
**Section**: degraded-mode-design <section reference>
**Issue**: <what's broken>
**Remediation**: <concrete fix>
degraded-mode-matrix.mdsecret-handling-runtime (cleanup hygiene during degraded modes)auth-factor-design (which factors trigger which degraded modes)chain-of-trust-design (the override-prompt anti-pattern is shared)data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.