agentic/code/addons/testing-quality/skills/flaky-detect/SKILL.md
Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.
npx skillsauth add jmagly/aiwg flaky-detectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Skill access pattern (post-kernel-pivot, 2026.5+)
Skill names referenced in this document are AIWG skills, not slash commands. Most are not kernel-listed and cannot be invoked as
/skill-nameby the platform. Reach them via:aiwg discover "<capability>" aiwg show skill <name>Only kernel-listed skills (
aiwg-doctor,aiwg-refresh,aiwg-status,aiwg-help,use,steward) are directly invokable as slash commands. See skill-discovery rule.
Identify flaky tests (tests that pass and fail non-deterministically) by analyzing CI history, execution patterns, and test characteristics. Google research shows 4.56% of tests are flaky, costing millions in developer productivity.
| Finding | Source | Reference | |---------|--------|-----------| | 4.56% flaky rate | Google (2016) | Flaky Tests at Google | | ML Classification | FlaKat (2024) | arXiv:2403.01003 - 85%+ accuracy | | LLM Auto-repair | FlakyFix (2023) | arXiv:2307.00012 | | Flaky Taxonomy | Luo et al. (2014) | "An Empirical Analysis of Flaky Tests" |
| Natural Language | Action | |------------------|--------| | "Find flaky tests" | Analyze CI history for flaky patterns | | "Why does CI keep failing?" | Identify flaky tests causing failures | | "Test suite is unreliable" | Full flaky test audit | | "This test sometimes passes" | Analyze specific test for flakiness | | "Audit test reliability" | Comprehensive flaky detection | | "Quarantine flaky tests" | Identify and isolate flaky tests |
| Category | Percentage | Root Causes | |----------|------------|-------------| | Async/Timing | 45% | Race conditions, insufficient waits, timeouts | | Test Order | 20% | Shared state, execution order dependencies | | Environment | 15% | File system, network, configuration differences | | Resource Limits | 10% | Memory, threads, connection pools | | Non-deterministic | 10% | Random values, timestamps, UUIDs |
Parse GitHub Actions / CI logs to find inconsistent results:
def analyze_ci_history(repo, days=30):
"""Analyze CI runs for flaky patterns"""
runs = get_ci_runs(repo, days)
test_results = {}
for run in runs:
for test in run.tests:
if test.name not in test_results:
test_results[test.name] = {"pass": 0, "fail": 0}
if test.passed:
test_results[test.name]["pass"] += 1
else:
test_results[test.name]["fail"] += 1
# Identify flaky tests (pass rate between 5% and 95%)
flaky = []
for test, results in test_results.items():
total = results["pass"] + results["fail"]
if total >= 5: # Enough data
pass_rate = results["pass"] / total
if 0.05 < pass_rate < 0.95:
flaky.append({
"test": test,
"pass_rate": pass_rate,
"total_runs": total
})
return sorted(flaky, key=lambda x: x["pass_rate"])
Scan test code for flaky patterns:
FLAKY_PATTERNS = [
# Timing issues
(r'setTimeout|sleep|delay', "timing", "Uses explicit delays"),
(r'Date\.now\(\)|new Date\(\)', "timing", "Uses current time"),
# Async issues
(r'\.then\([^)]*\)(?!.*await)', "async", "Promise without await"),
(r'async.*(?!await)', "async", "Async without await"),
# Order dependencies
(r'Math\.random\(\)', "random", "Uses random values"),
(r'uuid|nanoid', "random", "Uses generated IDs"),
# Environment
(r'process\.env', "environment", "Environment-dependent"),
(r'fs\.(read|write)', "environment", "File system access"),
(r'fetch\(|axios\.|http\.', "network", "Network calls"),
]
def scan_for_flaky_patterns(test_file):
"""Scan test file for flaky patterns"""
content = read_file(test_file)
matches = []
for pattern, category, description in FLAKY_PATTERNS:
if re.search(pattern, content):
matches.append({
"category": category,
"description": description,
"pattern": pattern
})
return matches
Run tests multiple times to detect flakiness:
# Run tests 10 times, track results
for i in {1..10}; do
npm test -- --reporter=json >> test-results.jsonl
done
# Analyze for inconsistency
python analyze_reruns.py test-results.jsonl
## Flaky Test Report
**Analysis Period**: Last 30 days
**Total Tests**: 450
**Flaky Tests Found**: 12 (2.7%)
### Critical Flaky Tests (< 50% pass rate)
#### 1. `test/api/login.test.ts:45`
**Pass Rate**: 42% (21/50 runs)
**Category**: Timing
**Pattern**: Uses `Date.now()` for token expiry
```typescript
// Flaky code
it('should expire token after 1 hour', () => {
const token = createToken();
const expiry = Date.now() + 3600000; // Flaky!
expect(token.expiresAt).toBe(expiry);
});
Root Cause: Test creates token and checks expiry in same millisecond sometimes, different millisecond other times.
Recommended Fix: Use mocked time
it('should expire token after 1 hour', () => {
vi.setSystemTime(new Date('2024-01-01T00:00:00Z'));
const token = createToken();
expect(token.expiresAt).toBe(new Date('2024-01-01T01:00:00Z').getTime());
vi.useRealTimers();
});
test/db/connection.test.ts:23Pass Rate: 68% (34/50 runs) Category: Resource Pattern: Connection pool exhaustion
[... more tests ...]
| Category | Count | Impact | |----------|-------|--------| | Timing | 5 | HIGH | | Async | 3 | HIGH | | Environment | 2 | MEDIUM | | Order | 1 | MEDIUM | | Network | 1 | LOW |
vi.setSystemTime() (+0.5% stability)These tests should be skipped in CI until fixed:
// vitest.config.ts
export default {
test: {
exclude: [
'test/api/login.test.ts', // Timing flaky
'test/db/connection.test.ts', // Resource flaky
]
}
}
Note: Track quarantined tests in .aiwg/testing/flaky-quarantine.md
## Quarantine Process
### 1. Identify
```bash
# Run flaky detection
python scripts/flaky_detect.py --ci-history 30 --threshold 95
// Mark test as flaky
describe.skip('flaky: login expiry', () => {
// FLAKY: https://github.com/org/repo/issues/123
// Root cause: timing-dependent
// Fix in progress: PR #456
});
Create tracking issue:
## Flaky Test: test/api/login.test.ts:45
- **Pass Rate**: 42%
- **Category**: Timing
- **Root Cause**: Uses real system time
- **Quarantined**: 2024-12-12
- **Fix PR**: #456
- **Target Unquarantine**: 2024-12-15
After fix:
# Verify fix with multiple runs
for i in {1..20}; do npm test -- test/api/login.test.ts; done
# Remove from quarantine if all pass
flaky-fix skill for automated repairsflow-gate-check for release decisions.aiwg/testing/flaky-registry.mdAnalyze CI history for flaky tests:
python scripts/flaky_detect.py --repo owner/repo --days 30
Scan code for flaky patterns:
python scripts/flaky_scanner.py --target test/
data-ai
Report which research-corpus radar sidecars are overdue for refresh. Computes staleness (days since last refresh vs the cadence window) for every radar, sorted most-overdue-first. Runs via `aiwg corpus radar-status`.
data-ai
Aggregate research-corpus radar sidecars into a corpus or per-cluster freshness report — totals, overdue count, per-cluster / per-GRADE / per-trajectory breakdowns, an overdue table, and per-radar rationale snippets. Runs via `aiwg corpus radar-report`.
testing
Scaffold radar/freshness sidecars for research-corpus REFs. Pulls title/authors from the citation sidecar and GRADE from the analysis doc, defaults the refresh cadence from GRADE and the cluster from a corpus-local map, and stamps documentation/radar/REF-XXX-radar.md. Runs via `aiwg corpus radar-init`.
data-ai
Compute an entity's publication trajectory — per-year paper counts, topic drift, hot-streak detection (≥3 consecutive A-grade years), and career phase. Runs via `aiwg corpus profile-temporal`.