skills/qa-changes/SKILL.md
This skill should be used when the user asks to "QA a pull request", "test PR changes", "verify a PR works", "functionally test changes", or when an automated workflow triggers QA validation of code changes. Provides a structured methodology for setting up the environment, exercising changed behavior, and reporting results.
npx skillsauth add openhands/extensions qa-changesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Validate pull request changes by actually running the code — not just reading it. The goal is to verify that new behavior works as the PR claims, existing behavior is not broken, and the repository remains healthy after the change.
The bar is high: test the way a thorough human QA engineer would. If the PR changes a web UI, spin up the server and verify it in a real browser. If it changes a CLI, run the CLI with real inputs. Do not settle for "the tests pass" — actually use the software.
QA proceeds in four phases. Complete each phase in order. If a phase fails, report the failure and stop.
Read the PR diff, title, and description. Identify the goal of this PR — this is the single most important thing to understand before proceeding. A PR might fix a bug, add a feature, refactor code, improve performance, update documentation, or something else entirely. Check:
Then classify every changed file:
For each change, identify the entry point — the concrete way a user would interact with it (CLI command, API endpoint, UI page, function call). This drives what to exercise in Phase 3.
Finally, form a clear hypothesis: "This PR should [achieve stated goal] by [approach taken in the diff]." Phase 3 will test that hypothesis.
Bootstrap the repository so the project builds and runs successfully.
AGENTS.md, README.md, Makefile, package.json, pyproject.toml, Cargo.toml, or equivalent. Always prefer the project's own documented setup commands.uv sync, npm install, pip install -r requirements.txt, bundle install, cargo build, etc.).If setup fails, report the failure with the exact error output and stop.
This is the most important phase. Actually use the software the way a real user would to verify the change works as the PR claims. This is what distinguishes QA from CI (which runs tests) and code review (which reads code).
Do NOT:
pytest, npm test, cargo test, etc.) — that is CI's job.DO:
--help, --dry-run, or --version is NOT functional verification — it only proves argument parsing works. If real execution fails due to missing credentials, external services, or environment constraints, report what you tried and what could not be verified. Do not substitute --help output for evidence the software works.Start by verifying the PR achieves its stated goal. Use the hypothesis from Phase 1. For example:
"Tests pass" is not a QA finding. The question is: does the software actually do what the PR says it does?
For frontend / UI changes:
For CLI changes:
For API / backend changes:
curl, httpie, or a test client) to affected endpoints.For bug fixes — use a before/after comparison:
For library / SDK changes:
For refactors:
For configuration / CI / docs:
Always show your work with a before/after narrative. For every verification, the report must include: (a) the exact command you ran, (b) the actual output you observed, and (c) your interpretation of that output. For bug fixes and behavioral changes, demonstrate BOTH the broken/old state AND the fixed/new state so the reviewer can see the delta. Present this evidence inside collapsible <details> blocks — the core deliverable is the verdict and summary, not raw logs.
Some verification approaches will fail due to environment constraints, missing system dependencies, or tooling limitations. That is expected.
The rule: if the same general approach fails after three materially different attempts, stop trying that approach. For example, if three different Playwright configurations all fail to connect to the dev server, do not try a fourth Playwright variation. Switch to a fundamentally different approach (e.g., curl + manual HTML inspection instead of browser automation). If two fundamentally different approaches both fail, give up on that specific verification and say so in the report.
When giving up on a verification:
AGENTS.md (or a custom /qa-changes skill) that would help future QA runs succeed — for example: which port the dev server runs on, what system packages are required, how to configure browser automation, or what the expected test output looks like.Do not silently skip verification. An honest "I could not verify X because Y" is far more valuable than a false "everything works."
Post a structured report as a PR review using the GitHub API. Keep the report scannable. A reviewer should grasp the verdict and key results in under 10 seconds. Put lengthy evidence (logs, code snippets, full command output) inside collapsible <details> blocks so the top-level report stays compact.
## {verdict_emoji} QA Report: {VERDICT}
{One-sentence summary of what was verified and the outcome.}
### Does this PR achieve its stated goal?
{Direct answer: Yes / Partially / No.}
{2-3 sentences explaining WHY, referencing specific evidence from
exercising the software. For bug fixes: is the bug actually fixed?
For features: does the new capability work end-to-end? For refactors:
is the restructuring achieved without changing behavior? Be specific
about what the goal was and whether the changes deliver on it.}
| Phase | Result |
|-------|--------|
| Environment Setup | {emoji} {one-line status} |
| CI Status | {emoji} {one-line note from CI checks, e.g. "all green" or "2 checks failing"} |
| Functional Verification | {emoji} {one-line status} |
<details><summary>Functional Verification</summary>
{Structure each verification as a before/after narrative:
### Test N: {Description}
**Step 1 — Reproduce / establish baseline (without the fix):**
Ran `{exact command}`:
{actual output}
This shows {interpretation — what the output means, e.g. "the bug
exists because..."}.
**Step 2 — Apply the PR's changes:**
{What was done — e.g. checked out the PR branch, set env var, etc.}
**Step 3 — Re-run with the fix in place:**
Ran `{same or equivalent command}`:
{actual output}
This shows {interpretation — e.g. "the fix works because the error
is gone and the expected result appears"}.
Repeat for each changed behavior. For non-bug-fix changes
(features, refactors), the baseline step may simply describe the
prior state rather than reproducing a failure.}
</details>
<details><summary>Unable to Verify</summary>
{What could not be verified, what was attempted, and suggested
AGENTS.md guidance. Omit this section entirely if everything
was verified.}
</details>
### Issues Found
{List concrete problems, or "None." if clean.}
- 🔴 **Blocker**: ...
- 🟠 **Issue**: ...
- 🟡 **Minor**: ...
<details> blocks. Any code block, log excerpt, or command output longer than ~4 lines belongs inside a collapsible. Reviewers who want proof can expand; others can skip.<details> block entirely.pytest, npm test, or equivalent test suites. That is CI's job.AGENTS.md improvements.tools
Create an automation that generates an async standup digest from Slack. Searches selected channels for messages since the previous workday, groups updates by project, highlights blockers and decisions, and posts a summary to a target channel.
tools
Create an automation that writes a recurring research brief. Uses Tavily MCP for web research and Notion MCP to publish the final brief with executive summary, implications, and source citations.
tools
Create an automation that triages new Linear issues. Inspects the issue title, description, team, customer, priority, and recent related issues via Linear MCP. Suggests labels, priority, likely owner, duplicates, and posts a clarifying comment.
tools
Create an automation that drafts incident retrospectives. Gathers incident-channel messages from Slack, collects linked tickets and follow-ups from Linear, and publishes a retrospective draft to Notion with a timeline, impact summary, root-cause hypotheses, and action items.