skills/black-box-red-testing/SKILL.md
Black-Box Red Testing — red tests that expose real bugs
npx skillsauth add liza-mas/liza black-box-red-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Write tests that catch, not tests that pass.
Produce tests that fail against real bugs. A good test is one that exposes a bug the codebase actually has (row 2 of the Test Truth Table in the Testing skill).
The agent needs bug hypotheses, not buggy code. The skill never modifies the codebase — it only writes tests.
If you have specific targets (changed files, low-coverage functions) or need classified findings, use the white-box-red-testing skill instead.
Target scope: A module, package, or file set to test adversarially.
If the target scope is large (many packages, broad module), narrow before starting. Adversarial testing is deep, not wide — pick the riskiest or most complex area first.
Before starting, assess the target:
| Parameter | Default | Description |
|-----------|---------|-------------|
| rounds | 1 | Number of rounds. Each round produces 10 tests. |
10 tests per round. Each round is a structured exploration budget. When running multiple rounds, each is a chance to pivot — if a direction isn't productive, refocus on another part of the code or another kind of bug.
Before hypothesizing, read the target's contracts:
These are inputs to sharper hypotheses, not classification evidence.
Form 10 distinct bug hypotheses for this round. A senior engineer knows the typical bugs — think:
Each hypothesis must be distinct (see Anti-Gaming below).
Hypothesis format (required):
Each hypothesis must be stated as a structured one-liner before writing the test:
[code_path] × [defect_class] → [observable_symptom]
Example output:
Hypotheses:
1. parse_date(empty_string) × missing_guard → uncaught ValueError
2. process_batch(>1000_items) × unbounded_loop → timeout
3. auth_token(expired) × stale_cache → false_positive_auth
4. save_record(unicode_name) × encoding_assumption → corrupted_output
5. concurrent_update(same_id) × race_condition → lost_write
...
This format forces precision. Vague hypotheses ("some edge case × something → failure") are immediately visible and must be sharpened before proceeding.
Write one test per hypothesis. Each test should:
Run all 10 tests against the existing codebase.
| Result | Action | |--------|--------| | Red | Keep — the test exposed something. | | Green | Discard — the code handled this correctly. | | Error (won't compile, bad setup) | Fix or discard — broken tests aren't adversarial, they're wrong. |
After triage, assess:
Pivot to a new angle for the next round, or continue deepening a productive one.
Pivot between rounds. If a round was unproductive, change angle — different area of the code, different class of bugs. Don't repeat the same approach.
Triage throttle. If a round produces 5+ red tests, pause for the coder to triage before spending the next round. A wall of red tests is noise, not signal.
An all-green round from high-quality hypotheses is a confidence signal, not waste. It means the code handles those cases correctly. This information is valuable — report it as such.
When specs are insufficient or absent, ambiguity becomes the attack surface.
A "compliant" red test that exploits a spec gap is a legitimate find. The agent is gaming for good — the adversarial move becomes a discovery tool that surfaces gaps the spec should have covered.
The skill does not classify whether a red test points to a code bug or a spec gap. The skill finds; the coder diagnoses. This separation mirrors TDD and the programmer/QA role distinction. A red test is a red test.
The contract's general anti-gaming clause applies. Additionally, within this skill:
Within a round, tests must represent distinct bug hypotheses. Ten variations of "nil pointer on empty input" is not ten hypotheses — it's one hypothesis tested ten ways. Vary the kind of bug, not just the input to the same bug.
Distinct means different in at least one of:
The structured hypothesis format makes gaming visible — if multiple lines share the same [code_path] × [defect_class] pattern with only input variations, they must be consolidated into one hypothesis.
Surviving red tests only. Green tests and broken tests are discarded.
Present each surviving red test with:
No hypothesis or classification is attached — the coder owns diagnosis.
If no red tests survive, report the confidence signal: state which areas were tested and what classes of bugs were hypothesized. The structured hypotheses from the round serve as documentation of verified absence.
The skill completes after the configured number of rounds. Early termination is acceptable if:
This skill operates within the Testing skill's constraints:
Tests written by this skill that survive (stay red) become inputs to the normal development workflow. The coder decides whether to fix the bug, update the spec, or accept the behavior.
development
Coordinate Pairing-mode doer/reviewer sessions through a Markdown blackboard. Use when the user invokes /adversarial-pairing with role and blackboard-path arguments or asks multiple pairing agents to coordinate plan review, implementation, staged code review, and follow-up review rounds without Liza multi-agent mode.
data-ai
Analyze Liza agents logs
development
Code Review Protocol
tools
Analyze Liza `.liza/agent-prompts/` and `.liza/agent-outputs/` from a context-engineering perspective: prompt payload shape, context budget use, cacheability, duplicated or missing context, instruction hierarchy, tool-output pressure, role-specific context fit, and prompt-output feedback loops. Use when diagnosing agent context bloat, prompt drift, poor agent handoffs, repeated misunderstandings, excessive tool output, or whether Liza agents received the right information at the right time.