Name: qa-cli
Author: laststance

qa-cli — Black-box QA for CLI tools

Codex Compatibility

When running this skill in Codex, translate Claude Code-only primitives before acting: AskUserQuestion -> chat/request_user_input, TodoWrite -> update_plan, Task/TaskCreate/TeamCreate/SendMessage -> spawn_agent/send_input/wait_agent when available and allowed, and EnterPlanMode/ExitPlanMode -> a concise chat plan plus explicit approval. Resolve Read/Write/Edit/Bash/WebSearch/WebFetch to Codex file/shell/web tools, and map ~/.claude/... paths to ~/.agents/... or ~/.codex/... unless the task explicitly targets Claude Code.

Cursor Compatibility

When running this skill in Cursor Agent, translate Claude Code-only primitives before acting: AskUserQuestion -> AskQuestion; TodoWrite -> Cursor TodoWrite or an equivalent checklist; Task/TaskCreate/TeamCreate/SendMessage/multi-agent flows -> Cursor Task (subagents), parallel Tasks, or run_in_background when allowed (TeamCreate/SendMessage may have no exact match); EnterPlanMode/ExitPlanMode -> Plan mode (SwitchMode / CreatePlan) plus explicit user approval. Resolve Read/Write/Edit/StrReplace/Bash/web/search/MCP via Cursor Composer or Agent equivalents. MCP names written as mcp__server__tool typically map to call_mcp_tool with configured server identifiers. Map ~/.claude/... to ~/.cursor/skills/, .cursor/skills/, and .cursor/rules/ unless the task explicitly targets Claude Code.

You audit a CLI tool without touching its source, produce a single Markdown report of observed issues, and stop. The methodology ports the laststance /qa pattern to the CLI runtime: invoke the tool as a user would, capture stdout / stderr / exit code / side effects, diff across environments (TTY vs pipe, NO_COLOR, LANG=C), triage, file.

Ground rules

Report-only. Never modify the tool's source, config, install prefix, or any global shell config. If you create scratch files during testing (fixtures, temp configs), put them under /tmp/qa-cli-session/ and clean up in Phase 9.
Black-box. Treat the binary / script as opaque. You may read its --help output, man page, and any user-visible docs, but do not grep its source to decide what to test.
Every claim needs evidence. A stdout / stderr excerpt, an exit code, a printenv-pair, or a before/after file listing. Commands alone are not evidence — capture their output.
Stop when you have enough. Time-box to the scope the user asked for. A 30-minute audit with a focused report beats a two-hour exhaustion with no narrative.

When to use

User says "QA my CLI" / "audit this tool" / "test my binary" / "find bugs in this command"
User has a CLI in the working tree and wants a pre-release sweep
User provides a binary path and a scope ("just the sync subcommand")

When NOT to use

Interactive TUI (anything that enters an alt-screen, full-screen redraw, live keystroke handling): use qa-tui instead
Web app / Electron / RN / iOS native: use the matching qa-* skill
Source-code review: use /review or a review-oriented agent
Fixing bugs: this skill only reports. Hand off afterward.

Inputs you need from the user

Ask once, up front, via AskUserQuestion:

Tool under test: binary path or invocation (./my-cli, node cli.js, pnpm run cli --, bun run cli.ts, a global like gh or kubectl)
Scope: full tool / one subcommand / specific flags — if unspecified, default to "top-level flags + each listed subcommand at surface level"
Duration budget: 15 min / 30 min / 60 min / exhaustive
Sensitive side effects?: if the CLI writes files, mutates a DB, posts to APIs — ask whether there's a --dry-run or a safe test target. Never run destructive commands without explicit scope.

Record these answers in the report header.

Output contract

One file, at ./qa-reports/cli-<YYYY-MM-DD>-<tool>.md, matching templates/qa-report-template-cli.md. Evidence (stdout / stderr captures, session transcripts) lives alongside under ./qa-reports/artifacts/cli-<date>-<tool>/.

No other files. No fixes. No PRs.

Session layout

Set up once in Phase 0:

/tmp/qa-cli-session/
├── baseline/
│   ├── help.txt              # `<tool> --help` raw capture
│   ├── version.txt
│   └── env.txt               # `printenv | sort` for reproducibility
├── cases/
│   ├── case-001/
│   │   ├── cmd               # the exact command run
│   │   ├── stdout
│   │   ├── stderr
│   │   └── exit              # numeric exit code
│   └── case-002/…
├── fixtures/                 # any sample input files you create
└── notes.md                  # running log — becomes the report

The rule: one case = one directory. It makes diffing across environments mechanical.

Phase 0 — Baseline + instrumentation

0.1 Confirm the tool runs and identify it

mkdir -p /tmp/qa-cli-session/{baseline,cases,fixtures}
cd /tmp/qa-cli-session

# Record the exact invocation — parameterize TOOL once, reuse.
TOOL="<user-provided-invocation>"
echo "$TOOL" > baseline/invocation

# Basics
$TOOL --version        > baseline/version.txt   2>&1 ; echo "exit=$?" >> baseline/version.txt
$TOOL --help           > baseline/help.txt      2>&1 ; echo "exit=$?" >> baseline/help.txt
$TOOL -h               > baseline/help-short.txt 2>&1 ; echo "exit=$?" >> baseline/help-short.txt

Note every mismatch already:

Does --help exit 0? (Should. Exit-code bug if it doesn't.)
Do --help and -h produce the same output? (Should, or one should be documented absent.)
Does --version print only the version on stdout, or does it emit a header / banner / ASCII art? (Banners on stdout break $(tool --version) users.)

0.2 Record the environment

printenv | sort > baseline/env.txt
uname -a         > baseline/os.txt
echo "$SHELL"   >> baseline/os.txt
locale           > baseline/locale.txt 2>&1

These are the ground truth for every "works for me" debate in the report.

0.3 Surface hygiene check

From baseline/help.txt, scan for:

Stub strings: TODO, FIXME, XXX, <fill in>, Lorem
Broken references: URLs like example.com, yourdomain
Contradictions: flag shown as --foo in summary but --FOO in detail

Any hit is at least a medium content issue.

0.4 Check `man` and completions (if the tool claims to provide them)

man <tool>           2>/dev/null | head -50    > baseline/man.txt
<tool> completion bash 2>/dev/null | head -20 > baseline/completion-bash.txt
<tool> completion zsh  2>/dev/null | head -20 > baseline/completion-zsh.txt

A broken completion script is a high-severity functional bug (breaks the user's shell if sourced).

Phase 1 — Surface map

The goal: enumerate every subcommand, every top-level flag, and note which combinations the tool advertises as supported.

From baseline/help.txt, build:

## Surface map

### Top-level flags
- `--verbose` / `-v`
- `--config <path>` / `-c`
- `--format <json|plain>` / `-f`
- `--no-color`
- `--dry-run`
- `--help` / `-h`
- `--version`

### Subcommands
- `init`      — …
- `sync`      — …
- `status`    — …
- `config get|set|unset` — nested subcommands
- `help <cmd>`

### Positional patterns
- `<tool> <file>`                  (top-level)
- `<tool> sync <source> <dest>`    (sync subcommand)

For nested subcommands, recurse: <tool> config --help, <tool> config get --help, etc. Cap the depth at what the user asked for.

Parking lot issues spotted during mapping (record, don't deep- dive yet):

Help lists a flag but doesn't describe it
A subcommand's --help references a flag the top-level help doesn't document
Short and long forms that disagree (-f listed as --force somewhere, --file somewhere else)

Phase 2 — Golden paths per surface

For each subcommand, run the "happiest possible" invocation and capture all three streams.

CASE=/tmp/qa-cli-session/cases/case-001
mkdir -p "$CASE"
CMD=($TOOL sync ./fixtures/a.txt ./fixtures/b.txt)
printf '%q ' "${CMD[@]}" > "$CASE/cmd"
"${CMD[@]}" > "$CASE/stdout" 2> "$CASE/stderr"
echo $? > "$CASE/exit"

Check:

Exit 0 on success? (Anything else on the golden path is a high.)
Is stdout the data the user wants, or is it mixed with progress / status output? (Progress should go to stderr; stdout should be pipe-safe.)
Is stderr empty, or does it contain benign diagnostics?
Are any side-effect files / dirs created? Record a ls -la before and after.

Repeat for every subcommand within scope. One case per invocation.

Phase 3 — Errors & validation

The CLI's error surface tells you how well it respects its users.

3.1 Bad flags

$TOOL --nonexistent-flag           # Expect: exit!=0, useful stderr
$TOOL -Z                            # Expect: same
$TOOL --config                      # Missing value for flag expecting one
$TOOL --format banana               # Invalid enum value
$TOOL sync                          # Missing required positional
$TOOL sync a b c d                  # Too many positionals
$TOOL --verbose --quiet             # Conflicting flags (if both exist)

For each, capture stdout/stderr/exit code. Red flags:

Exit 0 on bad input → critical
Helpful error in stdout instead of stderr → high (breaks >/dev/null 2>errors.log pattern)
Error message that's a stack trace or includes internal file paths → high (unprofessional, leaks internals)
Error with no suggestion and no See '<tool> --help' hint → medium

3.2 Bad input

Create fixtures that exercise the boundary:

# Empty file
: > fixtures/empty.txt

# File that doesn't exist
$TOOL process fixtures/does-not-exist.txt

# File with no read permission (if testing as a non-root tool)
echo hi > fixtures/denied.txt && chmod 000 fixtures/denied.txt
$TOOL process fixtures/denied.txt ; chmod 644 fixtures/denied.txt

# Binary data where the tool expects text
head -c 4096 /dev/urandom > fixtures/binary.bin
$TOOL process fixtures/binary.bin

# Input that looks like a flag
echo 'hello' > fixtures/--foo
$TOOL process fixtures/--foo          # should require `--` separator or quote

# Very large input (streaming test — see Phase 4)

Each case's shape matters: a FileNotFoundError traceback is a bug; error: no such file: 'fixtures/does-not-exist.txt' on stderr with exit !=0 is correct.

3.3 Bad environment

HOME unset: env -u HOME $TOOL
PATH stripped: env PATH= $TOOL (does the tool shell out? Should fail gracefully)
LANG=C: env LANG=C $TOOL --help — any mojibake?

Phase 4 — Output: stdout vs stderr, formats, colors, TTY detection

This is the phase CLIs most often fail.

4.1 Stdout / stderr separation

For the golden-path commands, confirm:

$TOOL command > /dev/null          # you should still see status / progress on stderr
$TOOL command 2> /dev/null         # you should still see the data on stdout
$TOOL command > out.txt 2> err.txt # the split is clean

Rules:

Data goes to stdout. Anything a script would want to capture ($(tool foo)) lives here.
Progress, warnings, status, errors go to stderr. They should never appear in >out.txt.
If --verbose adds chatty output, it goes to stderr unless it IS the data.

4.2 Format flags

If the tool advertises --format json / --json:

$TOOL list --format json | jq .       # must parse
$TOOL list --format json > out.json ; wc -l out.json  # size sanity

A tool that says it outputs JSON but produces invalid JSON is critical. Common bugs:

A banner / progress line leaks into the JSON stream
UTF-8 BOM at the start
Trailing progress \rs
print() debug statement left in → malformed JSON

4.3 Color: TTY vs pipe vs NO_COLOR

Invoke three ways:

$TOOL list                 # attached TTY → color OK
$TOOL list | cat           # piped → color must disappear (isatty check)
NO_COLOR=1 $TOOL list      # NO_COLOR spec → color must disappear
$TOOL list --no-color      # explicit flag → color must disappear

Grep each capture for ANSI codes:

grep -cP $'\x1b\\[' cases/case-NNN/stdout

Color on pipe = high (breaks piping into grep, less, etc.)
NO_COLOR=1 ignored = high (spec violation, https://no-color.org)
--no-color present but non-functional = high

4.4 Verbosity levels

If --verbose / -v / -vv / --quiet exist:

$TOOL command           2> stderr-normal
$TOOL command --verbose 2> stderr-verbose
$TOOL command --quiet   2> stderr-quiet
wc -l stderr-*          # expect: quiet < normal < verbose

--quiet that still emits warnings, or --verbose that adds nothing, are medium bugs.

Phase 5 — I/O: stdin, files, paths, globs

5.1 Stdin

If the tool advertises stdin support (<tool> - or implicit):

echo 'hello' | $TOOL process -          # piped
$TOOL process - < fixtures/a.txt        # redirected
$TOOL process                           # no stdin and no args — should fail fast,
                                        # NOT hang waiting for tty input in a
                                        # non-interactive context (common bug)
$TOOL process </dev/null                # closed stdin

Hangs when stdin is closed = critical (breaks CI, cron, any piped use).

5.2 Paths and globs

$TOOL process ~/file.txt                # tilde expansion works? (shell-level, but
                                        # check the tool resolves correctly)
$TOOL process './relative/path.txt'
$TOOL process '/absolute/path.txt'
$TOOL process 'fixtures/*.txt'          # does the tool expand globs or rely on
                                        # the shell? Document which.
$TOOL process 'path with spaces.txt'    # quoting

5.3 Output to a file

$TOOL export --out result.txt          # --out flag creates / overwrites?
$TOOL export --out /no-permission.txt  # clear error, not a traceback
$TOOL export --out /tmp/a/b/c/d.txt    # does it mkdir -p or fail?

Document what it does; note surprising choices.

Phase 6 — State: environment, config files, precedence

6.1 Env vars

Grep --help for mentioned env vars (XYZ_CONFIG, XYZ_TOKEN). For each:

env -u XYZ_CONFIG $TOOL ...             # what happens without it?
XYZ_CONFIG=/non/existent $TOOL ...       # bad path?
XYZ_CONFIG='' $TOOL ...                  # empty string?

6.2 Config file

If the tool reads a config (~/.config/<tool>/config.yaml, ./.toolrc, etc.):

Create a valid one; confirm it's loaded
Create an invalid one (malformed YAML / JSON); expect clear error on stderr, non-zero exit, not a parser traceback
Create an empty one; expect defaults

6.3 Precedence

The standard order is flag > env > config > default. Test it explicitly:

# Set config to value A, env to B, flag to C; confirm C wins.
printf 'setting: A\n' > fixtures/cfg.yaml
SETTING=B $TOOL --config fixtures/cfg.yaml --setting C ...
# Repeat with flag omitted (env B wins?), flag+env omitted (config A wins?)

Wrong precedence is high — it's invisible to users until they're deep in a bug hunt.

Phase 7 — Signals & cleanup

Only applicable if the tool runs for any non-trivial duration.

7.1 Ctrl+C (SIGINT) during work

# Kick off a long operation, then SIGINT it
$TOOL sync --large-dataset &
PID=$!
sleep 2
kill -INT $PID
wait $PID
echo "exit=$?"

Check:

Does the tool exit promptly (<1s after SIGINT)?
Exit code is 130 (128 + SIGINT) by convention, or at least non-zero?
Does it print "cancelled" / "interrupted" on stderr?
Does it leave behind partial files, temp dirs, lockfiles? Do a ls /tmp/qa-cli-session and a ls -la in the target dir before and after.

7.2 SIGTERM

$TOOL sync --large-dataset &
kill -TERM $!
wait $!

Same checks. Tools that ignore SIGTERM and keep running are critical in orchestrated environments (Kubernetes, systemd).

7.3 Lockfile / pidfile hygiene

If the tool uses a lockfile (.tool.lock):

Kill -9 it, then retry. Does it detect a stale lock and recover, or does it refuse forever? Stale-lock-blocks-forever = high.

Phase 8 — Triage & report

8.1 Reconcile the notes

Open /tmp/qa-cli-session/notes.md. For each observation:

Assign a category from references/issue-taxonomy-cli.md
Assign a severity (critical / high / medium / low) using the taxonomy guidance
Confirm you have evidence (case directory number, excerpt ready to quote)
De-dupe: multiple surfaces with the same underlying issue → one issue with affected surfaces listed

8.2 Rank and pick the Top 3

The "Top 3 things to fix" section at the top of the report is the only thing a busy maintainer will read. Pick the three with the highest user impact — usually (a) anything that exits 0 on a real error, (b) anything that breaks piping, (c) anything that ignores NO_COLOR / --no-color.

8.3 Write the report

Use templates/qa-report-template-cli.md. Fill every field. If a section is empty, write "None observed" — don't delete it.

Store evidence excerpts inline (3–10 lines each with a fenced code block). If a full capture is >20 lines, put it in ./qa-reports/artifacts/<session>/ and link it.

8.4 Health score

Use the category weights in the template. Subtract:

20 per critical
8 per high
3 per medium
1 per low

Clamp to [0, 100]. Express as {score}/100. The number is a rough signal; the issue list is what matters.

Phase 9 — Session cleanup

# Remove scratch fixtures if they contain sensitive output
rm -rf /tmp/qa-cli-session

# If the tool left side effects (config files, lockfiles, temp dirs),
# note each one in the report's "Residue" section and either clean it
# up (if clearly a test artifact) or flag it for the user.

Confirm you did not create any files outside /tmp/qa-cli-session and ./qa-reports/ during the run.

References

references/issue-taxonomy-cli.md — categories and severity guidance for every kind of CLI bug
references/cli-driver-reference.md — shell patterns for capturing streams, measuring duration, exercising TTY detection
references/cli-conventions.md — POSIX / GNU / isatty / NO_COLOR / exit-code conventions to compare observed behavior against

Template

templates/qa-report-template-cli.md — the output format

qa-cli — Black-box QA for CLI tools

Codex Compatibility

Cursor Compatibility

Ground rules

Report-only. Never modify the tool's source, config, install prefix, or any global shell config. If you create scratch files during testing (fixtures, temp configs), put them under /tmp/qa-cli-session/ and clean up in Phase 9.
Black-box. Treat the binary / script as opaque. You may read its --help output, man page, and any user-visible docs, but do not grep its source to decide what to test.
Every claim needs evidence. A stdout / stderr excerpt, an exit code, a printenv-pair, or a before/after file listing. Commands alone are not evidence — capture their output.
Stop when you have enough. Time-box to the scope the user asked for. A 30-minute audit with a focused report beats a two-hour exhaustion with no narrative.

When to use

User says "QA my CLI" / "audit this tool" / "test my binary" / "find bugs in this command"
User has a CLI in the working tree and wants a pre-release sweep
User provides a binary path and a scope ("just the sync subcommand")

When NOT to use

Interactive TUI (anything that enters an alt-screen, full-screen redraw, live keystroke handling): use qa-tui instead
Web app / Electron / RN / iOS native: use the matching qa-* skill
Source-code review: use /review or a review-oriented agent
Fixing bugs: this skill only reports. Hand off afterward.

Inputs you need from the user

Ask once, up front, via AskUserQuestion:

Tool under test: binary path or invocation (./my-cli, node cli.js, pnpm run cli --, bun run cli.ts, a global like gh or kubectl)
Scope: full tool / one subcommand / specific flags — if unspecified, default to "top-level flags + each listed subcommand at surface level"
Duration budget: 15 min / 30 min / 60 min / exhaustive
Sensitive side effects?: if the CLI writes files, mutates a DB, posts to APIs — ask whether there's a --dry-run or a safe test target. Never run destructive commands without explicit scope.

Record these answers in the report header.

Output contract

No other files. No fixes. No PRs.

Session layout

Set up once in Phase 0:

/tmp/qa-cli-session/
├── baseline/
│   ├── help.txt              # `<tool> --help` raw capture
│   ├── version.txt
│   └── env.txt               # `printenv | sort` for reproducibility
├── cases/
│   ├── case-001/
│   │   ├── cmd               # the exact command run
│   │   ├── stdout
│   │   ├── stderr
│   │   └── exit              # numeric exit code
│   └── case-002/…
├── fixtures/                 # any sample input files you create
└── notes.md                  # running log — becomes the report

The rule: one case = one directory. It makes diffing across environments mechanical.

Phase 0 — Baseline + instrumentation

0.1 Confirm the tool runs and identify it

mkdir -p /tmp/qa-cli-session/{baseline,cases,fixtures}
cd /tmp/qa-cli-session

# Record the exact invocation — parameterize TOOL once, reuse.
TOOL="<user-provided-invocation>"
echo "$TOOL" > baseline/invocation

# Basics
$TOOL --version        > baseline/version.txt   2>&1 ; echo "exit=$?" >> baseline/version.txt
$TOOL --help           > baseline/help.txt      2>&1 ; echo "exit=$?" >> baseline/help.txt
$TOOL -h               > baseline/help-short.txt 2>&1 ; echo "exit=$?" >> baseline/help-short.txt

Note every mismatch already:

Does --help exit 0? (Should. Exit-code bug if it doesn't.)
Do --help and -h produce the same output? (Should, or one should be documented absent.)
Does --version print only the version on stdout, or does it emit a header / banner / ASCII art? (Banners on stdout break $(tool --version) users.)

0.2 Record the environment

printenv | sort > baseline/env.txt
uname -a         > baseline/os.txt
echo "$SHELL"   >> baseline/os.txt
locale           > baseline/locale.txt 2>&1

These are the ground truth for every "works for me" debate in the report.

0.3 Surface hygiene check

From baseline/help.txt, scan for:

Stub strings: TODO, FIXME, XXX, <fill in>, Lorem
Broken references: URLs like example.com, yourdomain
Contradictions: flag shown as --foo in summary but --FOO in detail

Any hit is at least a medium content issue.

0.4 Check `man` and completions (if the tool claims to provide them)

man <tool>           2>/dev/null | head -50    > baseline/man.txt
<tool> completion bash 2>/dev/null | head -20 > baseline/completion-bash.txt
<tool> completion zsh  2>/dev/null | head -20 > baseline/completion-zsh.txt

A broken completion script is a high-severity functional bug (breaks the user's shell if sourced).

Phase 1 — Surface map

The goal: enumerate every subcommand, every top-level flag, and note which combinations the tool advertises as supported.

From baseline/help.txt, build:

## Surface map

### Top-level flags
- `--verbose` / `-v`
- `--config <path>` / `-c`
- `--format <json|plain>` / `-f`
- `--no-color`
- `--dry-run`
- `--help` / `-h`
- `--version`

### Subcommands
- `init`      — …
- `sync`      — …
- `status`    — …
- `config get|set|unset` — nested subcommands
- `help <cmd>`

### Positional patterns
- `<tool> <file>`                  (top-level)
- `<tool> sync <source> <dest>`    (sync subcommand)

For nested subcommands, recurse: <tool> config --help, <tool> config get --help, etc. Cap the depth at what the user asked for.

Parking lot issues spotted during mapping (record, don't deep- dive yet):

Help lists a flag but doesn't describe it
A subcommand's --help references a flag the top-level help doesn't document
Short and long forms that disagree (-f listed as --force somewhere, --file somewhere else)

Phase 2 — Golden paths per surface

For each subcommand, run the "happiest possible" invocation and capture all three streams.

CASE=/tmp/qa-cli-session/cases/case-001
mkdir -p "$CASE"
CMD=($TOOL sync ./fixtures/a.txt ./fixtures/b.txt)
printf '%q ' "${CMD[@]}" > "$CASE/cmd"
"${CMD[@]}" > "$CASE/stdout" 2> "$CASE/stderr"
echo $? > "$CASE/exit"

Check:

Exit 0 on success? (Anything else on the golden path is a high.)
Is stdout the data the user wants, or is it mixed with progress / status output? (Progress should go to stderr; stdout should be pipe-safe.)
Is stderr empty, or does it contain benign diagnostics?
Are any side-effect files / dirs created? Record a ls -la before and after.

Repeat for every subcommand within scope. One case per invocation.

Phase 3 — Errors & validation

The CLI's error surface tells you how well it respects its users.

3.1 Bad flags

$TOOL --nonexistent-flag           # Expect: exit!=0, useful stderr
$TOOL -Z                            # Expect: same
$TOOL --config                      # Missing value for flag expecting one
$TOOL --format banana               # Invalid enum value
$TOOL sync                          # Missing required positional
$TOOL sync a b c d                  # Too many positionals
$TOOL --verbose --quiet             # Conflicting flags (if both exist)

For each, capture stdout/stderr/exit code. Red flags:

Exit 0 on bad input → critical
Helpful error in stdout instead of stderr → high (breaks >/dev/null 2>errors.log pattern)
Error message that's a stack trace or includes internal file paths → high (unprofessional, leaks internals)
Error with no suggestion and no See '<tool> --help' hint → medium

3.2 Bad input

Create fixtures that exercise the boundary:

# Empty file
: > fixtures/empty.txt

# File that doesn't exist
$TOOL process fixtures/does-not-exist.txt

# File with no read permission (if testing as a non-root tool)
echo hi > fixtures/denied.txt && chmod 000 fixtures/denied.txt
$TOOL process fixtures/denied.txt ; chmod 644 fixtures/denied.txt

# Binary data where the tool expects text
head -c 4096 /dev/urandom > fixtures/binary.bin
$TOOL process fixtures/binary.bin

# Input that looks like a flag
echo 'hello' > fixtures/--foo
$TOOL process fixtures/--foo          # should require `--` separator or quote

# Very large input (streaming test — see Phase 4)

Each case's shape matters: a FileNotFoundError traceback is a bug; error: no such file: 'fixtures/does-not-exist.txt' on stderr with exit !=0 is correct.

3.3 Bad environment

HOME unset: env -u HOME $TOOL
PATH stripped: env PATH= $TOOL (does the tool shell out? Should fail gracefully)
LANG=C: env LANG=C $TOOL --help — any mojibake?

Phase 4 — Output: stdout vs stderr, formats, colors, TTY detection

This is the phase CLIs most often fail.

4.1 Stdout / stderr separation

For the golden-path commands, confirm:

$TOOL command > /dev/null          # you should still see status / progress on stderr
$TOOL command 2> /dev/null         # you should still see the data on stdout
$TOOL command > out.txt 2> err.txt # the split is clean

Rules:

Data goes to stdout. Anything a script would want to capture ($(tool foo)) lives here.
Progress, warnings, status, errors go to stderr. They should never appear in >out.txt.
If --verbose adds chatty output, it goes to stderr unless it IS the data.

4.2 Format flags

If the tool advertises --format json / --json:

$TOOL list --format json | jq .       # must parse
$TOOL list --format json > out.json ; wc -l out.json  # size sanity

A tool that says it outputs JSON but produces invalid JSON is critical. Common bugs:

A banner / progress line leaks into the JSON stream
UTF-8 BOM at the start
Trailing progress \rs
print() debug statement left in → malformed JSON

4.3 Color: TTY vs pipe vs NO_COLOR

Invoke three ways:

$TOOL list                 # attached TTY → color OK
$TOOL list | cat           # piped → color must disappear (isatty check)
NO_COLOR=1 $TOOL list      # NO_COLOR spec → color must disappear
$TOOL list --no-color      # explicit flag → color must disappear

Grep each capture for ANSI codes:

grep -cP $'\x1b\\[' cases/case-NNN/stdout

Color on pipe = high (breaks piping into grep, less, etc.)
NO_COLOR=1 ignored = high (spec violation, https://no-color.org)
--no-color present but non-functional = high

4.4 Verbosity levels

If --verbose / -v / -vv / --quiet exist:

$TOOL command           2> stderr-normal
$TOOL command --verbose 2> stderr-verbose
$TOOL command --quiet   2> stderr-quiet
wc -l stderr-*          # expect: quiet < normal < verbose

--quiet that still emits warnings, or --verbose that adds nothing, are medium bugs.

Phase 5 — I/O: stdin, files, paths, globs

5.1 Stdin

If the tool advertises stdin support (<tool> - or implicit):

echo 'hello' | $TOOL process -          # piped
$TOOL process - < fixtures/a.txt        # redirected
$TOOL process                           # no stdin and no args — should fail fast,
                                        # NOT hang waiting for tty input in a
                                        # non-interactive context (common bug)
$TOOL process </dev/null                # closed stdin

Hangs when stdin is closed = critical (breaks CI, cron, any piped use).

5.2 Paths and globs

$TOOL process ~/file.txt                # tilde expansion works? (shell-level, but
                                        # check the tool resolves correctly)
$TOOL process './relative/path.txt'
$TOOL process '/absolute/path.txt'
$TOOL process 'fixtures/*.txt'          # does the tool expand globs or rely on
                                        # the shell? Document which.
$TOOL process 'path with spaces.txt'    # quoting

5.3 Output to a file

$TOOL export --out result.txt          # --out flag creates / overwrites?
$TOOL export --out /no-permission.txt  # clear error, not a traceback
$TOOL export --out /tmp/a/b/c/d.txt    # does it mkdir -p or fail?

Document what it does; note surprising choices.

Phase 6 — State: environment, config files, precedence

6.1 Env vars

Grep --help for mentioned env vars (XYZ_CONFIG, XYZ_TOKEN). For each:

env -u XYZ_CONFIG $TOOL ...             # what happens without it?
XYZ_CONFIG=/non/existent $TOOL ...       # bad path?
XYZ_CONFIG='' $TOOL ...                  # empty string?

6.2 Config file

If the tool reads a config (~/.config/<tool>/config.yaml, ./.toolrc, etc.):

Create a valid one; confirm it's loaded
Create an invalid one (malformed YAML / JSON); expect clear error on stderr, non-zero exit, not a parser traceback
Create an empty one; expect defaults

6.3 Precedence

The standard order is flag > env > config > default. Test it explicitly:

# Set config to value A, env to B, flag to C; confirm C wins.
printf 'setting: A\n' > fixtures/cfg.yaml
SETTING=B $TOOL --config fixtures/cfg.yaml --setting C ...
# Repeat with flag omitted (env B wins?), flag+env omitted (config A wins?)

Wrong precedence is high — it's invisible to users until they're deep in a bug hunt.

Phase 7 — Signals & cleanup

Only applicable if the tool runs for any non-trivial duration.

7.1 Ctrl+C (SIGINT) during work

# Kick off a long operation, then SIGINT it
$TOOL sync --large-dataset &
PID=$!
sleep 2
kill -INT $PID
wait $PID
echo "exit=$?"

Check:

Does the tool exit promptly (<1s after SIGINT)?
Exit code is 130 (128 + SIGINT) by convention, or at least non-zero?
Does it print "cancelled" / "interrupted" on stderr?
Does it leave behind partial files, temp dirs, lockfiles? Do a ls /tmp/qa-cli-session and a ls -la in the target dir before and after.

7.2 SIGTERM

$TOOL sync --large-dataset &
kill -TERM $!
wait $!

Same checks. Tools that ignore SIGTERM and keep running are critical in orchestrated environments (Kubernetes, systemd).

7.3 Lockfile / pidfile hygiene

If the tool uses a lockfile (.tool.lock):

Kill -9 it, then retry. Does it detect a stale lock and recover, or does it refuse forever? Stale-lock-blocks-forever = high.

Phase 8 — Triage & report

8.1 Reconcile the notes

Open /tmp/qa-cli-session/notes.md. For each observation:

Assign a category from references/issue-taxonomy-cli.md
Assign a severity (critical / high / medium / low) using the taxonomy guidance
Confirm you have evidence (case directory number, excerpt ready to quote)
De-dupe: multiple surfaces with the same underlying issue → one issue with affected surfaces listed

8.2 Rank and pick the Top 3

8.3 Write the report

Use templates/qa-report-template-cli.md. Fill every field. If a section is empty, write "None observed" — don't delete it.

Store evidence excerpts inline (3–10 lines each with a fenced code block). If a full capture is >20 lines, put it in ./qa-reports/artifacts/<session>/ and link it.

8.4 Health score

Use the category weights in the template. Subtract:

20 per critical
8 per high
3 per medium
1 per low

Clamp to [0, 100]. Express as {score}/100. The number is a rough signal; the issue list is what matters.

Phase 9 — Session cleanup

# Remove scratch fixtures if they contain sensitive output
rm -rf /tmp/qa-cli-session

# If the tool left side effects (config files, lockfiles, temp dirs),
# note each one in the report's "Residue" section and either clean it
# up (if clearly a test artifact) or flag it for the user.

Confirm you did not create any files outside /tmp/qa-cli-session and ./qa-reports/ during the run.

References

references/issue-taxonomy-cli.md — categories and severity guidance for every kind of CLI bug
references/cli-driver-reference.md — shell patterns for capturing streams, measuring duration, exercising TTY detection
references/cli-conventions.md — POSIX / GNU / isatty / NO_COLOR / exit-code conventions to compare observed behavior against

Template

templates/qa-report-template-cli.md — the output format

Adoption

laststance/qa-cli

$ install --global

Security Scan Results

SKILL.md

qa-cli — Black-box QA for CLI tools

Codex Compatibility

Cursor Compatibility

Ground rules

When to use

When NOT to use

Inputs you need from the user

Output contract

Session layout

Phase 0 — Baseline + instrumentation

0.1 Confirm the tool runs and identify it

0.2 Record the environment

0.3 Surface hygiene check

0.4 Check man and completions (if the tool claims to provide them)

Phase 1 — Surface map

Phase 2 — Golden paths per surface

Phase 3 — Errors & validation

3.1 Bad flags

3.2 Bad input

3.3 Bad environment

Phase 4 — Output: stdout vs stderr, formats, colors, TTY detection

4.1 Stdout / stderr separation

4.2 Format flags

4.3 Color: TTY vs pipe vs NO_COLOR

4.4 Verbosity levels

Phase 5 — I/O: stdin, files, paths, globs

5.1 Stdin

5.2 Paths and globs

5.3 Output to a file

Phase 6 — State: environment, config files, precedence

6.1 Env vars

6.2 Config file

6.3 Precedence

Phase 7 — Signals & cleanup

7.1 Ctrl+C (SIGINT) during work

7.2 SIGTERM

7.3 Lockfile / pidfile hygiene

Phase 8 — Triage & report

8.1 Reconcile the notes

8.2 Rank and pick the Top 3

8.3 Write the report

8.4 Health score

Phase 9 — Session cleanup

References

Template

Related Skills

laststance/source-grounded-research

laststance/i-write-code

laststance/core-topic

laststance/copy

laststance/qa-cli

$ install --global

Security Scan Results

SKILL.md

qa-cli — Black-box QA for CLI tools

Codex Compatibility

Cursor Compatibility

Ground rules

When to use

When NOT to use

Inputs you need from the user

Output contract

Session layout

Phase 0 — Baseline + instrumentation

0.1 Confirm the tool runs and identify it

0.2 Record the environment

0.3 Surface hygiene check

0.4 Check man and completions (if the tool claims to provide them)

Phase 1 — Surface map

Phase 2 — Golden paths per surface

Phase 3 — Errors & validation

3.1 Bad flags

3.2 Bad input

3.3 Bad environment

Phase 4 — Output: stdout vs stderr, formats, colors, TTY detection

0.4 Check `man` and completions (if the tool claims to provide them)

0.4 Check `man` and completions (if the tool claims to provide them)