skills/pr-evidence/SKILL.md
Use when a code review finding needs proof — write a focused test in JavaScript or Go that either confirms the issue is real or exposes it as over-engineering hyperbole. Trigger after code-review or code-review-skill findings are presented and evidence is requested.
npx skillsauth add stevefeldman/agents-skills pr-evidenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn a code review finding into a concrete, runnable test that either proves the concern is real or demonstrates it is exaggerated. Every claim in a review should be verifiable. This skill produces that verification.
A finding without evidence is an opinion. Write the minimal test that settles the question — and be willing to conclude the reviewer was wrong.
Non-negotiable rules:
Before evaluating any finding, verify every assumption the reviewer made about library behavior, runtime semantics, or framework internals:
node_modules/<pkg>/src/ or the built file). Do not assume what a prop, method, or config key does.Skipping this step caused two Critical findings in a real review to be completely wrong (a library
contentIdassumed to be a CMS key was actually an ARIAid; a subtraction assumed to produceNaNwas actually coerced by the JS-operator). A third finding was falsely confirmed as Critical because the confirming test used inputs the calling code structurally prevents — the function was only "broken" for states the system cannot produce.
Answer all four questions. Record your answers explicitly — do not skip.
| # | Question | If YES | If NO | |---|----------|--------|-------| | Q1 | Is this code in a measured hot path? | Performance concern may be real | Benchmark is likely premature | | Q2 | Does the scale (data size, frequency, scenario) actually occur in production? | Test at realistic scale | Finding is probably hypothetical | | Q3 | Does the framework, runtime, or language already handle this? | Finding is likely noise | Investigate further | | Q4 | Does the recommended fix introduce new complexity (abstraction, guard, indirection)? | Weigh cost vs. gain carefully | Fix is low-risk |
Scoring:
Record your answers in the Evidence Report as: Q1:Y Q2:N Q3:N Q4:Y.
Separately evaluate whether the recommended fix itself is over-engineered, independent of whether the finding is real.
Answer these questions about the fix:
| Question | Signal | |----------|--------| | Does the fix add a guard for a code path that callers already guarantee won't occur? | Fix may be unnecessary | | Does the fix extract an abstraction used in fewer than 3 places? | DRY is premature | | Does the fix add complexity to handle a scenario that has never occurred and has no test proving it can? | Fix may be speculative | | Would a future reader be confused by the "fixed" code without the PR context? | Fix may obscure intent | | Does the fix add lines of code without changing the observable output in any test? | Fix adds churn |
If 2+ signals are present: the recommendation may be over-engineered even if the underlying observation is correct. Note this in the report — a valid observation can still lead to a bad recommendation.
Write one minimal, falsifiable test per finding. The test must be designed so it can fail either way — a test that can only confirm is not evidence.
Finding received
│
▼
[Step 1] Read source / trace call path / verify runtime behavior
│
▼
[Step 2] Hyperbole Check — answer Q1–Q4
│
├─ 2+ NO → Write REFUTING test (try to prove reviewer wrong)
│
└─ Passes → Write CONFIRMING test (try to prove reviewer right)
│
▼
[Step 3] Over-Engineering Check on the fix
│
┌─────────────────┘
▼
Run the test — observe actual output
│
▼
Record in Evidence Report
Performance Finding — "This is O(n²) and will be slow."
Benchmark at realistic production scale (grep for actual limits in schema/config — never use worst-case academic n).
test('nested loop perf at actual max scale (50 items)', () => {
// actual max dataset size from codebase: 50 items (checked via grep/schema)
const items = Array.from({ length: 50 }, (_, i) => ({ id: i, value: `item-${i}` }));
const start = performance.now();
const results = items.flatMap(a => items.map(b => a.id === b.id));
const elapsed = performance.now() - start;
expect(elapsed).toBeLessThan(5); // verdict: O(n²) at n=50 is 2500 ops, not a bottleneck
});
func BenchmarkNestedLoopAtScale(b *testing.B) {
items := make([]Item, 50) // realistic max from schema/config
for i := range items { items[i] = Item{ID: i, Value: fmt.Sprintf("item-%d", i)} }
b.ResetTimer()
for range b.N {
for _, a := range items {
for _, bItem := range items { _ = a.ID == bItem.ID }
}
}
// if ns/op < 10000 (10µs), claim is hyperbole at this scale
}
Missing Abstraction Finding — "This should be extracted into a reusable helper / design pattern."
Count actual call sites. If fewer than 3, DRY is premature.
test('abstraction cost exceeds benefit at current call-site count', () => {
const createUser = (name, role) => ({ name, role, createdAt: new Date() });
class UserFactory {
static create(name, role) { return { name, role, createdAt: new Date() }; }
}
const inline = createUser('alice', 'admin');
const factored = UserFactory.create('alice', 'admin');
expect(inline).toMatchObject({ name: 'alice', role: 'admin' });
expect(factored).toMatchObject({ name: 'alice', role: 'admin' });
// verdict: factory adds a class, import, and indirection for zero behavioral gain at 2 call sites
});
Error Handling Finding — "This function needs more defensive checks."
Trace all callers. If inputs are guaranteed by internal contracts, the validation is dead code.
func TestParseConfigNeverReceivesNil(t *testing.T) {
cfg := &Config{Host: "localhost", Port: 8080}
result, err := parseConfig(cfg)
require.NoError(t, err)
require.NotNil(t, result)
// verdict: nil guard would be unreachable — adds complexity for an impossible path
}
Security / Input Validation Finding — "User input here is not sanitized."
Trace the input source. If it originates from an auth-gated system or enum, it may not need sanitization.
test('userId comes from verified JWT, not raw user input', () => {
const mockReq = { user: { id: '123' } }; // set by auth middleware after JWT verification
expect(mockReq.user.id).toMatch(/^\d+$/); // JWT sub claim is always numeric string
// verdict: field is constrained by JWT structure — no sanitization attack surface
});
Complexity / Coupling Finding — "This is too tightly coupled."
Test the boundary in isolation. If it mocks cleanly, coupling is probably fine.
func TestHandlerInIsolation(t *testing.T) {
store := &stubUserStore{user: &User{ID: "1", Name: "Alice"}}
handler := NewUserHandler(store)
req := httptest.NewRequest("GET", "/users/1", nil)
rec := httptest.NewRecorder()
handler.GetUser(rec, req)
assert.Equal(t, 200, rec.Code)
// verdict: handler tests cleanly in isolation — coupling concern is overstated
}
type stubUserStore struct{ user *User }
func (s *stubUserStore) Get(id string) (*User, error) { return s.user, nil }
Before running any confirming test — one designed to prove a bug exists — answer:
If the confirming state requires violating a caller-enforced invariant, do not declare CONFIRMED. The finding is a fragility: the code behaves incorrectly for inputs it should never receive. Adjust the verdict to NUANCED, cap severity at Medium, and document the invariant being relied on.
A test that proves a bug using unreachable inputs is not evidence of a production bug — it is evidence of a code smell worth noting at lower severity.
Do not report results without running the test. Execute using the project's test runner:
npx jest --testPathPattern=<evidence-file> --verbose --no-coveragego test -run TestFindingName ./... or go test -bench BenchmarkName ./...If dependencies are not installed, install them (npm ci, go mod download) before running. Confirm the test output before writing the verdict.
Paste or summarize the actual test output in the Evidence Report.
After running all tests, produce this report. Every finding evaluated must appear in the table.
PR: [number and title] Findings evaluated: [N] Tests written: [N] | Tests run: [N] | Tests passed: [N]
| Finding | Hyperbole Check | Over-Engineering Check | Test Result | Verdict | Severity Change | |---------|----------------|----------------------|-------------|---------|----------------| | #N — [title] | Q1:? Q2:? Q3:? Q4:? | [Pass / 🔴 Over-engineered] | ✅ Pass / ❌ Fail | CONFIRMED / REFUTED / NUANCED | Critical→Low / No change |
For each CONFIRMED finding:
## Evidence: [Finding Title]
**Claim:** [exact reviewer claim]
**Hyperbole Check:** Q1:Y Q2:Y Q3:N Q4:N — passes
**Over-Engineering Check:** Fix is proportionate
**Verdict:** CONFIRMED
**Test:** [filename:line or inline snippet]
**Result:** [actual test output]
**Conclusion:** [1–2 sentences on what specifically makes this a real concern]
For each REFUTED or NUANCED finding:
## Evidence: [Finding Title]
**Claim:** [exact reviewer claim]
**Hyperbole Check:** Q1:N Q2:N Q3:Y Q4:Y — fails (2+ NO)
**What the reviewer got wrong:** [specific error — wrong assumption about library, operator, scale, etc.]
**Verdict:** REFUTED / NUANCED
**Test:** [filename:line or inline snippet]
**Result:** [actual test output]
**Conclusion:** [1–2 sentences. State what the reviewer assumed vs. what the evidence shows.
If nuanced, explain under what conditions the finding would become real.]
| Claim Pattern | Likely Verdict | Evidence Strategy | |---------------|----------------|-------------------| | "This will be slow at scale" | Refuted if n < 1000 | Benchmark at actual max n | | "This needs a design pattern" | Refuted if < 3 call sites | Count call sites, show inline suffices | | "This isn't reusable" | Refuted if only used once | Show single-use is intentional | | "This should have error handling" | Refuted if caller guarantees valid input | Trace input origin | | "This is too complex" | Context-dependent | Cyclomatic complexity metric + readability test | | "This has a security risk" | Refuted if input is bounded | Trace input source to auth boundary | | "This couples X to Y" | Refuted if mockable | Write isolation test | | "This prop/key does X" | Refuted if source says otherwise | Read the library source | | "This operator produces NaN/error" | Refuted if language coerces | Write a minimal arithmetic proof |
This skill is the evidence layer on top of code-review and code-review-skill. Typical flow:
code-review-skill or code-review produces findingspr-evidence — run all 6 steps for every findingdevelopment
Use when reviewing Dependabot alerts, npm audit findings, govulncheck output, or CVE reports on a JavaScript/Node.js or Go project — especially when triaging multiple alerts across direct and transitive dependencies to assess real-world risk and produce a remediation plan.
development
Produce data-driven software delivery estimates by analyzing historical JIRA tickets, git activity, and engineer track records, then matching the new work against the most similar past tickets. Use this skill whenever the user asks "how long will this take", wants to estimate a piece of work, scope an epic, plan a sprint, or estimate delivery for JIRA stories or a Figma design. Also use whenever the user wants developer-to-work assignment recommendations based on history, wants to optimize an estimate by adding or reallocating engineers, or asks "what's the fastest way to ship this" or "who should work on this". Especially trigger when the user provides JIRA ticket IDs, JIRA story links, or Figma designs together with any indication of a team that will execute the work.
tools
Use when auditing an existing test suite for quality and coverage gaps, evaluating Playwright migration readiness, scoring automation against a world-class e-commerce standard, or guiding the creation of new tests. Applicable to Selenium, WebdriverIO, and Playwright suites.
development
PR-focused code review with structured checklist, severity ratings, and actionable feedback