- name:
- hypothesis-driven
- description:
- Investigation discipline for extremely complex problems where the root cause, exploit path, or solution is unknown: hard bugs, flaky systems, CTF challenges, reverse engineering puzzles, incident triage, multi-system failures, and research questions backed by data. Forces explicit hypothesis generation, falsifiable predictions, prioritized experiments, and evidence-based iteration instead of trial-and-error patching or confirmation-biased reasoning. Use when symptoms are far from causes, when guesses keep failing, or when a problem spans tools, layers, or unknowns that defeat linear debugging.
- license:
- MIT
- compatibility:
- AgentSkills-compatible investigation workflow for debugging, exploit research, CTF solving, incident response, reverse engineering, and data-backed technical research.
- author:
- AeonDave
- version:
- 1.0
Hypothesis-Driven Investigation
Treat the unknown like a scientist, not a gambler. Every guess is a hypothesis; every action is an experiment that should be able to disprove it.
The point is not to be right on the first try — it is to converge on the truth without burning hours on confirmation bias, sunk cost, or symptom-chasing.
When to activate
- A bug, crash, regression, or flaky behavior resists two or more direct fixes.
- A CTF, exploit, or reversing challenge has many possible paths and no obvious next step.
- An incident, outage, or unexpected production behavior has unclear scope or cause.
- Research, fuzzing, or scanner output points in multiple directions and must be triaged.
- The symptom is far from the likely cause: heap corruption, async timing, protocol state, ABI mismatch, cache coherence, cryptographic oracle, side channel, supply chain.
- The agent notices itself looping, repeating searches, or escalating tools without new evidence.
Do not activate for trivial fixes, single-line typos, or tasks where the user already gave a step-by-step procedure. Prefer the smallest competent workflow.
Core rules
- Hypotheses must be falsifiable. If no observation could disprove it, it is not a hypothesis — it is a belief.
- Reduce before deep-diving. Shrink noisy inputs, traces, repro steps, or artifact scope until the failure still occurs with minimal irrelevant detail.
- One experiment, one hypothesis. Test a single variable; otherwise you cannot tell which assumption was wrong.
- Write it down. Maintain an explicit log of hypotheses, predictions, evidence, and verdicts. The cost is small; the cost of re-testing the same idea twice is large.
- Evidence wins over preference. When data contradicts a favorite hypothesis, kill the hypothesis, not the data.
- Stop the loop, not the work. If three plausible fixes failed, the mental model is wrong. Re-examine assumptions before trying a fourth.
Loop guard
Pause and re-frame as soon as one of these is true: three distinct fix attempts failed, new evidence contradicts a load-bearing assumption, or the next experiment needs access, authorization, or destructive action beyond approved scope. The goal is disciplined progress, not heroic thrashing.
Workflow
-
Frame the problem
- Restate the symptom in one precise sentence: what is observed, where, when, under which conditions.
- List what is known, what is assumed, and what is unknown. Separate facts from inference.
- Preserve a reproducer or evidence baseline. If the input/artifact/log is large, reduce it first while keeping the same failure signal.
- Define a falsification target: "this hypothesis is wrong if I see X."
-
Generate hypotheses (breadth before depth)
- Enumerate plausible causes before committing to one. Aim for 3-7 candidates, MECE-style (mutually exclusive, collectively exhaustive enough to cover the space).
- If there is only one candidate, assume anchoring until proven otherwise and generate alternatives. If there are more than ten, group them into higher-level branches before testing.
- Build a diagnostic "why" tree before jumping to a solution "how" tree; solution ideas are premature until the cause branch is supported.
- For each candidate, note the mechanism: how would this cause produce the observed symptom?
- Reject any candidate that is not testable with available access, time, or tooling.
-
Prioritize
- Score each hypothesis on three axes: prior likelihood, cost to test, and information gained if disproven.
- Test cheapest-and-most-informative first. A fast experiment that eliminates a whole branch beats a slow one that only refines a narrow guess.
- Prefer experiments that bisect the unknown space (binary search of the hypothesis tree).
-
Design the experiment
- State the prediction: "if hypothesis H is true, then doing X will produce Y."
- State the falsifier: "if I see Z instead, H is wrong."
- Choose the smallest reliable observation: log, breakpoint, sanitizer, packet capture, syscall trace, oracle query, diff, replay, controlled input.
- Prefer probes that expose structure: assertions/contracts for invariants, dependency tracing or slicing for value origins, and hypothesis-tagged instrumentation for runtime facts.
- Add a negative control when possible: an input or condition where the hypothesis predicts no effect.
-
Run and record
- Execute the experiment unchanged. Resist the urge to modify mid-run.
- Capture raw evidence: command, input, output, timestamp/context, environment, version. Tag temporary logs or probes with the hypothesis ID so evidence stays attributable.
- Record the verdict next to the hypothesis: supported, refuted, inconclusive, blocked.
-
Update beliefs (Bayesian, not stubborn)
- If refuted: cross it out and move on. Do not resurrect without new evidence.
- If supported: refine into a more specific sub-hypothesis. One supportive experiment is not proof.
- If inconclusive: ask why the experiment was weak before designing the next one. Often the falsifier was not sharp enough.
- Pause and re-frame whenever evidence contradicts a foundational assumption.
-
Converge to a diagnosis
- A diagnosis is a hypothesis that (a) explains every prior observation and (b) predicts the next observation correctly.
- Before acting on it, write the causal chain end-to-end: defect → faulty state → mechanism → observed symptom. If a link is hand-waved, the diagnosis is incomplete.
- Show both causality and incorrectness: why this state caused the failure, and why the state itself is wrong rather than merely surprising.
-
Act, then verify
- Apply the smallest change that the diagnosis predicts will work.
- Re-run the original reproducer and at least one negative control.
- When safe, run a counter-experiment: revert the fix or reintroduce the condition and confirm the original symptom returns.
- If the fix works but the diagnosis was wrong, the symptom is likely to return; investigate further before claiming completion.
Hypothesis log format
Keep it short. Update in place; do not let it grow into prose.
H1: <one-line hypothesis>
Mechanism: <how it would cause the symptom>
Predicts: <observation expected if H1 is true>
Falsifier: <observation that would disprove H1>
Test: <command / experiment>
Evidence: <raw output, link, or summary>
Verdict: supported | refuted | inconclusive | blocked
A handful of these in a scratch note or session memory is enough to keep the investigation honest and avoid repeating dead ends.
Mini-example:
H2: The decoder output is corrupt because the loop reads one byte past the input.
Mechanism: index i reaches len(input), so the last output byte is derived from adjacent state.
Predicts: reducing the input by one byte moves the corrupt byte one position earlier.
Falsifier: corruption stays at the same logical position after input reduction.
Test: run decoder on N-byte and N-1-byte inputs; compare final bytes.
Evidence: N-1 run shifts corruption left by one byte.
Verdict: supported; refine to loop-bound hypothesis before patching.
Common failure modes to resist
- Confirmation bias: designing experiments that can only confirm the favorite idea. Always include a falsifier.
- Anchoring: locking onto the first hypothesis because it was first, not because it is best. Generate the alternatives anyway.
- Symptom chasing: fixing where the program crashes instead of where the bad state originated. Trace backward.
- Sunk-cost continuation: refusing to drop a hypothesis because effort was already spent on it. Effort is not evidence.
- Tool authority: treating a scanner, decompiler, or LLM suggestion as a fact. It is a lead until reproduced.
- Compound changes: changing two things between tests. The next failure will be unattributable.
- Vague predictions: "something should change." If the prediction has no specific observable, the experiment is wasted.
- Silent assumptions: building on an unchecked premise (version, target, encoding, endianness, scope, identity). Promote the riskiest assumption to a hypothesis and test it early.
- Unreduced reproducers: investigating a full trace, exploit chain, PCAP, or corpus item when a smaller failing case would remove noise.
- Jumping the inference ladder: moving from selected data to action without writing the interpretation and assumptions in between.
- Solutioning too early: building a "how to fix" tree before the diagnostic "why" branch is supported.
Domain-specific accents
- Debugging: minimize the reproducer, isolate environment, then bisect the code/data path with the cheapest instrumentation. Add assertions around invariants and trace data/control dependencies when the bad value's origin is unclear. Pair with
systematic-debugging and test-driven-development.
- CTF / exploit research: enumerate attack surfaces before deep-diving one. Use oracles and probes that return distinguishable outputs (yes/no, timing, length). Bias toward experiments that eliminate whole categories (auth vs injection vs deserialization vs logic).
- Reverse engineering: separate observations (what the binary does) from inferences (why). Confirm guessed semantics with a controlled input before generalizing.
- Incident response: enumerate possible contributors before naming a root cause; distinguish root cause, trigger, contributing factors, impact, and mitigation. Correlate timelines, deploys, configs, and dependencies. Action items should be owned, measurable, and aimed at prevention or faster detection, not blame.
- Research and data analysis: pre-register the prediction before running the query; otherwise the analysis silently fits the data to the favorite story.
Stop conditions
- The diagnosis is complete: the causal chain is explicit and reproduces the symptom on demand.
- The next experiment requires access, authorization, or destructive action beyond approved scope — pause and escalate.
- Three plausible fixes failed: stop patching, re-open assumptions.
- Evidence contradicts a load-bearing assumption: re-frame before generating more hypotheses on a broken foundation.
Output contract
When reporting investigation results, include:
- Symptom: the precise observable being explained.
- Diagnosis: the supported causal chain, or "no diagnosis yet" with current best hypotheses.
- Evidence: experiments that confirmed it and at least one that ruled out a credible alternative.
- Limits: what was not tested, what assumptions remain.
- Next step: smallest action to either act on the diagnosis or sharpen it.
Resources
Load on demand:
references/hypothesis-patterns.md — concrete hypothesis templates and falsifier examples per domain (debugging, CTF/exploit, reversing, incident, research).
references/cognitive-biases.md — biases that derail hypothesis-driven work and counter-moves to apply during investigation.
references/research-foundations.md — source-backed foundations for scientific debugging, reduction, slicing, SRE postmortems, issue trees, and agentic debugging patterns.
Pair with:
systematic-debugging for code-level reproduction, instrumentation, and patching once the hypothesis narrows to a code defect. hypothesis-driven is the investigation discipline; systematic-debugging is the execution protocol for code bugs.
evidence-before-claims before reporting a cause, vulnerability, or fix as confirmed.
verification-before-completion before claiming the investigation is done.
design-before-implementation when the diagnosis triggers a non-trivial change.
known-problem-hint-research when local hypotheses are exhausted and an external clue is needed.