.claude/skills/review-test-reliability/SKILL.md
Audit AHK test suite for flaky patterns — blind sleeps, instant asserts, stale signals
npx skillsauth add cwilliams5/Alt-Tabby review-test-reliabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Enter planning mode. Audit the AHK test suite for reliability anti-patterns that cause intermittent failures. Use maximum parallelism — spawn explore agents for independent test files.
Scope: All test code in tests/ — unit tests, GUI tests, live tests, pump/watcher tests, test utilities. NOT static analysis checks, NOT production code. This is about test reliability, not coverage (review-test-coverage), quality (review-test-quality), or speed (review-test-speed).
An external operation (process kill, schtasks command, file write, IPC message, pipe disconnect) followed by an immediate assertion without polling for the effect to land.
Detection: Look for sequences where ProcessClose, Run, FileAppend, DllCall("PostMessageW"...), _RunWithTimeout, or similar is followed by an assertion (AssertEq, if (!...), Log("FAIL:...")) with no poll loop between them.
Real example (fixed in this codebase):
; BAD — schtasks /delete returns before Task Scheduler DB updates
DeleteAdminTask(testTaskName)
if (!AdminTaskExists(testTaskName)) ; can still return true
; GOOD — poll for eventual consistency
deleteStart := A_TickCount
while ((A_TickCount - deleteStart) < 3000) {
if (!AdminTaskExists(testTaskName))
break
Sleep(100)
}
A Sleep(N) that waits for something to happen, but never checks whether it actually happened. The sleep duration is a guess — too short causes flakes under load, too long wastes time always.
Detection: Look for Sleep(N) where N > 50 that is followed by an assertion or state check. The sleep exists to "give time" for an async effect. If removing the sleep would make the assertion fail, it's a blind sleep that should be a poll.
Exceptions (NOT anti-patterns):
Sleep(20-50) inside an existing poll loop (throttling between checks)Sleep(50) between two file writes to prevent coalescing)Real example (fixed in this codebase):
; BAD — 500ms blind wait, hoping GUI initializes in time
Sleep(500)
FileAppend(content, blPath) ; file watcher may not be ready
; GOOD — poll for readiness signal, proceed immediately
while ((A_TickCount - settleStart) < 8000) {
if (FileExist(storeLogPath)) {
settleReady := true
break
}
Sleep(100)
}
ProcessClose(pid) followed by Sleep(N) instead of polling ProcessExist(pid). The process may not die within the sleep window, causing downstream failures (file locks, stale PIDs, port conflicts).
Detection: Search for ProcessClose followed by Sleep without a ProcessExist poll loop.
; BAD — process may still hold file handles
ProcessClose(pid)
Sleep(200)
; GOOD — confirm process is actually gone
ProcessClose(pid)
deadStart := A_TickCount
while (ProcessExist(pid) && (A_TickCount - deadStart) < 2000)
Sleep(50)
This is especially critical in cleanup helpers used by multiple tests (e.g., _Test_KillAllAltTabby, _Cleanup functions).
Polling for a condition that doesn't reliably indicate what the test actually needs. The signal can false-positive (stale data from prior runs) or false-negative (signal appears before the actual thing being waited for).
Detection: Look for polls that check:
Real example (found in this codebase):
; RISKY — stale store log from prior run could false-positive
if (FileExist(storeLogPath)) {
settleReady := true
; BETTER — check for content that proves THIS run's GUI is active
if (FileExist(storeLogPath)) {
try {
logContent := FileRead(storeLogPath)
if (InStr(logContent, "WEH_ProcessBatch"))
settleReady := true
}
}
Testing that something does NOT happen by sleeping and then asserting the absence. The sleep duration is a gamble — the event might arrive after the sleep expires.
Detection: Pattern is Sleep(N) → if (!flag) where the test expects the flag to remain false. Common in file watcher and timer callback tests.
; RISKY — 500ms may not be enough under system load
Sleep(500)
if (!callbackFired) {
Log("PASS: callback correctly did not fire")
}
; BETTER — longer poll confirms absence more convincingly
noFireStart := A_TickCount
while ((A_TickCount - noFireStart) < 2000) {
if (callbackFired)
break
Sleep(50)
}
if (!callbackFired) {
Log("PASS: callback correctly did not fire (waited 2s)")
}
For negative assertions, err on the side of waiting LONGER (not shorter) — false-pass is worse than slow test.
test_utils.ahk)Cleanup helpers and process management functions. Issues here multiply across every test that uses them.
test_live_*.ahk)Process launches, IPC, file watchers, subprocess management. These interact with the OS and are the most timing-sensitive.
test_unit_file_watcher.ahk)Debounce timing, negative assertions (callback should NOT fire), filesystem event propagation.
gui_tests*.ahk)Timer callbacks, state machine transitions, rendering pipeline interactions. Use query_timers.ps1 to inventory which timer callbacks tests should exercise.
test_unit_setup.ahk)Task Scheduler operations, shortcut creation, anything involving Windows system services.
Sleep(20-50) inside poll loops (throttling, not waiting)Sleep(50) in test setup before the test begins (filesystem settle before watcher init)Sleep(N) that IS the thing being tested (e.g., grace period behavior)After explore agents report back, validate every finding yourself. Reliability issues require precise analysis — a pattern that looks like a blind sleep may actually be intentional.
For each finding:
P0 — Shared helpers: Reliability issues in test_utils.ahk or cleanup functions that affect multiple test suites. Fix these first — one fix improves many tests.
P1 — Live test flakes: Blind sleeps or instant asserts in live tests that interact with external processes. These are the most likely to flake under CI load.
P2 — Unit test timing: Blind sleeps in unit tests (file watcher, debounce). Less likely to flake but still unreliable in principle.
P3 — Negative assertions: Blind sleeps proving something doesn't happen. Low flake probability (sleeping longer makes it safer) but violates the principle.
Section 1 — P0: Shared helpers
| File | Lines | Pattern | Blast Radius | Fix |
|------|-------|---------|-------------|-----|
| test_utils.ahk | 204 | ProcessClose + Sleep(200) | Used by all live tests | Poll ProcessExist with 2s timeout |
Section 2 — P1: Live test flakes
| File | Lines | Pattern | Race Condition | Fix |
|------|-------|---------|---------------|-----|
| test_live_foo.ahk | 120 | Instant assert after file write | File watcher not initialized | Poll for readiness signal |
Section 3 — P2: Unit test timing
| File | Lines | Pattern | Fix |
|------|-------|---------|-----|
| test_unit_foo.ahk | 88 | Sleep(600) for debounce settle | Poll for callback count |
Section 4 — P3: Negative assertions
| File | Lines | Current Wait | Proposed Wait | Condition |
|------|-------|-------------|--------------|-----------|
| test_unit_foo.ahk | 60 | Sleep(500) | Poll 2s | callbackFired stays false |
Order by severity, then blast radius within each level.
Ignore any existing plans — create a fresh one.
tools
Create a new git worktree and switch the session into it
tools
Spawn agent to trace code flow via query tools — answer only, no context cost
tools
Commit, push, and create a PR for the current branch
tools
Retire a shader by moving its files to legacy/shaders_retired