skills/can-we-classify-flaky/SKILL.md
Analyze test suites for flaky tests using LLM-based classification with context-augmented reasoning. Applies findings from Berndt et al. (2026) showing that test code alone is insufficient — the skill teaches Claude to gather surrounding project context (configs, dependencies, environment, production code) before classifying. Trigger phrases: 'find flaky tests', 'classify flaky tests', 'detect test flakiness', 'why is this test flaky', 'analyze test reliability', 'flaky test triage'
npx skillsauth add ndpvt-web/arxiv-claude-skills can-we-classify-flakyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to analyze test code for flakiness — the property where a test yields inconsistent pass/fail results on the same code revision. Based on empirical findings from Berndt et al. (2026), which demonstrated that LLMs performing zero-shot, few-shot, or chain-of-thought classification on isolated test code perform only marginally better than random guessing, this skill implements a context-augmented approach: gathering project artifacts (build configs, production code under test, environment setup, CI configuration) before making a flakiness judgment. The paper's key insight is that flakiness signals live outside the test method itself, so this skill systematically retrieves that missing context.
Why test code alone fails. Berndt et al. evaluated GPT-4o, GPT-4o-mini, and CodeLlama across zero-shot, few-shot, and chain-of-thought prompting on two established datasets (IDoFT and FlakeFlagger). Even the best prompt-model combination achieved results only marginally above random chance (MCC near zero). A manual investigation of 50 samples confirmed the root cause: isolated test methods lack the contextual signals needed to determine flakiness. A Thread.sleep(1000) in a test might be harmless or catastrophic depending on what it waits for — information that only exists in production code, CI config, or runtime environment.
The six flakiness root-cause categories identified in flaky test literature — and confirmed as undetectable from test code alone — are: (1) async/timing dependencies (waits, sleeps, timeouts whose adequacy depends on external systems), (2) concurrency and shared state (tests that modify shared resources without isolation), (3) test-order dependencies (tests that assume execution order or prior state), (4) external service dependencies (network calls, databases, APIs that may be unavailable), (5) environment sensitivity (file paths, OS-specific behavior, timezone/locale), and (6) randomness and non-determinism (unseeded RNGs, hash iteration order). Detecting most of these requires seeing what the test interacts with, not just the test itself.
The context-augmented approach. Instead of classifying from test code alone (which the paper proves ineffective), this skill implements what the paper recommends as future work: a retrieval-augmented strategy that gathers production code, configuration, and environment details before reasoning about flakiness. This transforms classification from a code-only task into a project-aware analysis.
Extract the test method(s) under analysis. Read the full test file and isolate each test method, preserving class-level setup/teardown (@Before, @After, setUp, tearDown, beforeEach, etc.) and any shared fields or fixtures.
Identify the production code under test. Trace imports and method calls from the test to locate the actual classes/functions being tested. Read those source files to understand what the test exercises — this is the single most important context the paper found missing.
Gather build and dependency configuration. Read pom.xml, build.gradle, package.json, requirements.txt, or equivalent. Check for test framework versions, parallelism settings (forkCount, parallel, --workers), and timeout configurations that affect test execution.
Check CI/environment configuration. Read .github/workflows/, Jenkinsfile, .gitlab-ci.yml, docker-compose.yml, or equivalent. Identify whether tests run in parallel, in containers, with specific environment variables, or against external services.
Scan for the six flakiness root-cause categories. For each test, systematically check:
sleep, wait, setTimeout, Thread.sleep, polling loops, @Timeout annotationsMath.random(), Random without seed, UUID.randomUUID() in assertions, HashMap iterationCross-reference test signals with production context. A Thread.sleep(500) is only problematic if the operation it waits on can exceed 500ms. A database call is only flaky if the test uses a real (not mocked) connection. This step resolves ambiguities that the paper showed are impossible to resolve from test code alone.
Classify each test with confidence and rationale. Assign one of: LIKELY_FLAKY (clear root cause identified with context), POSSIBLY_FLAKY (suspicious patterns but mitigating factors present), or UNLIKELY_FLAKY (no detectable risk factors). Always state the specific root-cause category and the evidence from both test and production code.
Recommend targeted fixes. For each flaky or possibly-flaky test, provide a concrete remediation: replace sleep with polling/await, mock external services, add test isolation, seed random generators, use @TempDir instead of hardcoded paths, etc.
Example 1: Timing-dependent test in a Java project
User: "This test fails intermittently in CI. Is it flaky?"
@Test
public void testAsyncNotification() {
notificationService.sendAsync("[email protected]", "Hello");
Thread.sleep(2000);
List<Notification> sent = notificationRepository.findAll();
assertEquals(1, sent.size());
}
Approach:
Thread.sleep(2000) and async call patternNotificationService.sendAsync() source — discovers it uses a thread pool with configurable queue size and sends via SMTP-DforkCount=4 (parallel execution) on a shared CI runnerapplication-test.properties — finds it points to a real SMTP test server, not a mockOutput:
Classification: LIKELY_FLAKY
Root cause: Async/timing dependency
Confidence: High
Evidence:
- Test code: Thread.sleep(2000) used as a hard wait for async operation
- Production code: NotificationService.sendAsync() submits to thread pool, then
sends SMTP email — total latency depends on SMTP server response time
- CI config: 4 parallel forks compete for resources, increasing latency variance
- No mock: Test hits a real SMTP server, adding network non-determinism
Recommendation:
Replace Thread.sleep with Awaitility polling:
await().atMost(10, SECONDS)
.untilAsserted(() ->
assertEquals(1, notificationRepository.findAll().size()));
Consider mocking the SMTP transport in unit tests.
Example 2: Shared-state test in a Python project
User: "Some tests pass alone but fail when the full suite runs. Help me find which ones are flaky."
Approach:
conftest.py fixtures — check scope (session, module, function) and whether fixtures mutate shared statepytest.ini / pyproject.toml — check for --randomly-seed, parallel execution (-n auto), test ordering pluginsOutput:
Found 3 tests with order-dependency risk:
1. test_user_creation (test_auth.py:45)
Classification: LIKELY_FLAKY (order dependency)
- Inserts into `users` table using module-scoped db fixture
- No teardown/rollback — subsequent tests see leftover rows
- Fix: Use function-scoped fixture with transaction rollback
2. test_cache_expiry (test_cache.py:78)
Classification: POSSIBLY_FLAKY (shared state)
- Reads from module-level `_cache = {}` dict populated by test_cache_set
- If test_cache_set runs first, passes; otherwise, KeyError
- Fix: Initialize cache in each test or use a fixture
3. test_config_override (test_settings.py:22)
Classification: LIKELY_FLAKY (environment + order dependency)
- Monkeypatches os.environ["APP_MODE"] but uses module scope
- Other tests in same module inherit the patched env
- Fix: Use function-scoped monkeypatch or `with mock.patch.dict()`
Example 3: Test that looks suspicious but is actually fine
User: "Is this test flaky? It uses random data."
def test_sorting_algorithm():
random.seed(42)
data = [random.randint(0, 1000) for _ in range(100)]
result = my_sort(data)
assert result == sorted(data)
Approach:
random usage but also random.seed(42) (deterministic)my_sort() source — confirms it is a pure function with no side effectsOutput:
Classification: UNLIKELY_FLAKY
Root cause: None detected
Evidence:
- Random is seeded with fixed seed (42) — output is deterministic
- my_sort() is a pure function with no external dependencies
- No shared state, no I/O, no timing sensitivity
No action needed. This test is well-constructed.
sleep() or random() call indicates flakiness. Context determines whether these patterns are problematic. A seeded random or a sleep that vastly exceeds the operation's maximum latency may be perfectly safe.sleep with a very generous timeout), classify as POSSIBLY_FLAKY and explain both the risk and the mitigation.pytest --count=10, Maven Surefire rerunFailingTestsCount). LLM-based analysis is a triage tool, not a definitive oracle.Berndt, A., Bekmyradov, V., Gemulla, R., Kessel, M., & Bach, T. (2026). Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study. arXiv:2602.05465v1. SANER-RENE 2025. https://arxiv.org/abs/2602.05465v1
Key takeaway: Test-code-only flakiness classification with LLMs (GPT-4o, GPT-4o-mini, CodeLlama) across zero-shot, few-shot, and chain-of-thought prompting yields results near random chance (MCC ~ 0). The critical missing ingredient is project context — production code, build configuration, CI setup, and runtime environment. Any practical flakiness classifier must retrieve this context before reasoning.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".