skills/thinking-scientific-method/SKILL.md
Use when a symptom could have several causes and you must find the faulty code by ranking falsifiable hypotheses and checking the cheapest discriminating observation first.
npx skillsauth add tjboudreaux/cc-thinking-skills thinking-scientific-methodInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The scientific method's payoff for an agent is not narrating "observe -> question." It is the differential: when a symptom could come from several places, enumerate competing falsifiable hypotheses and spend your cheapest observation on the one that best discriminates between them.
This is the proven replacement for the old broad scientific-method skill. In SWE-bench fault localization, the original was flat; this agent-native version turned it into the strongest measured debugging lift in the current eval set.
Core Principle: Don't guess-and-patch. Enumerate competing causes, then make the cheapest observation that would falsify the most likely one.
Symptom has several plausible causes?
-> no -> test or fix the obvious cause directly
-> yes -> can you make cheap observations now?
-> no -> gather access/evidence first
-> yes -> apply hypothesis-differential debugging
List 3-5 specific, falsifiable hypotheses. Name the likely file, function, subsystem, input condition, or invariant. Avoid vague buckets like "backend issue" or "race condition somewhere."
| # | Hypothesis | Why plausible? |
|---|------------|----------------|
| H1 | `auth/session.py:refresh` drops rotated tokens | failures start after token rotation |
| H2 | cache TTL mismatch in `session_cache` | stale sessions persist across deploys |
| H3 | frontend retries reuse expired cookie | only browser flow is affected |
If you can only think of one hypothesis, you are guessing. Force alternatives before inspecting deeper.
For each hypothesis, name one observation you can make now that would separate it from the others.
Good observations:
Bad observations:
For each hypothesis, write what result would make you drop it. This prevents confirmation search.
| Hypothesis | Falsified if... |
|------------|-----------------|
| H1 token refresh | refresh path never reads rotated token state |
| H2 cache TTL | cache entry expires before the observed stale window |
| H3 frontend retry | same failure occurs in API-only reproduction |
Test the observation with the best expected information per unit of effort. Start with the cheapest observation that separates your top hypotheses, not the most elaborate investigation.
For each hypothesis → name falsifier → rank by likelihood x cheapness → observe → update/drop → localize fault
Stop when one hypothesis is supported by direct evidence and the key alternatives are ruled out. Name the file/function/config to change and the evidence that localizes it.
## Symptom
[Specific failing behavior, scope, timing, and known constraints]
## Hypotheses
| # | Hypothesis | Why plausible? | Cheapest observation | Falsified if... |
|---|------------|----------------|----------------------|-----------------|
| H1 | [specific file/function/config cause] | [evidence] | [read/grep/diff/check] | [drop condition] |
| H2 | [specific alternate cause] | [evidence] | [read/grep/diff/check] | [drop condition] |
| H3 | [specific alternate cause] | [evidence] | [read/grep/diff/check] | [drop condition] |
## Test Order
1. [Cheapest discriminating observation]
2. [Next observation if H1 is falsified]
3. [Deferred only if cheap observations do not localize]
## Localization
[Supported hypothesis, ruled-out alternatives, and the file/function/config to change]
Symptom: intermittent 500s on /export, only eu-west, started three days ago.
Hypotheses:
1. recent diff to export serializer
Observation: inspect commits touching `export_serializer`
Falsified if: no diff touches the failing codepath
2. eu-west Redis rotation broke a cache key
Observation: read cache key construction + region config
Falsified if: key and TTL match healthy regions
3. upstream timeout under load
Observation: compare timeout logs during failure window
Falsified if: no upstream latency spike
Test order: H1, H2, H3.
Result: H1 diff changed nested export handling and matches stack trace.
Localized fault: `app/export/serializer.py`.
"The first principle is that you must not fool yourself - and you are the easiest person to fool." Your intuition generates hypotheses; the differential tests them.
tools
About to add a feature/layer/process to fix a problem. First ask what to remove instead — subtraction is often more robust than addition. Use for simplification and complexity reduction.
development
Use when stuck between two architecture or API requirements that seem mutually exclusive — name the contradiction precisely, then separate the conflicting states in time, space, or condition.
testing
You need to trace how a system would fail or behave at a scale you can't cheaply test or measure. Use to imagine the scenario and walk the consequence chain step by step.
devops
Use when optimizing latency or throughput in a pipeline and one stage dominates—focus all effort on that single bottleneck, since speeding up the others changes nothing until it's fixed.