1kalin/afrexai-qa-testing-engine/SKILL.md
# QA & Testing Engine — Complete Software Quality System > The definitive testing methodology for AI agents. From test strategy to execution, coverage to reporting — everything you need to ship quality software. ## Phase 1: Test Strategy Design Before writing a single test, design the strategy. ### Strategy Brief Template ```yaml project: name: "" type: web-app | api | mobile | library | cli | data-pipeline languages: [typescript, python, go, java] frameworks: [react, express, djang
npx skillsauth add openclaw/skills 1kalin/afrexai-qa-testing-engineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The definitive testing methodology for AI agents. From test strategy to execution, coverage to reporting — everything you need to ship quality software.
Before writing a single test, design the strategy.
project:
name: ""
type: web-app | api | mobile | library | cli | data-pipeline
languages: [typescript, python, go, java]
frameworks: [react, express, django, spring]
risk_profile:
data_sensitivity: low | medium | high | critical # PII, financial, health
user_impact: internal | b2b | b2c | life-safety
deployment_frequency: daily | weekly | monthly
regulatory: [none, SOC2, HIPAA, PCI-DSS, GDPR]
test_scope:
in_scope: [] # Features, services, components
out_of_scope: [] # Explicitly excluded (with reason)
environments:
dev: { url: "", db: "local" }
staging: { url: "", db: "seeded" }
prod: { url: "", smoke_only: true }
| Risk Profile | Unit | Integration | E2E | Performance | Security | Accessibility | |---|---|---|---|---|---|---| | Internal tool | ✅ Core | ✅ API | ⚠️ Happy path | ❌ | ⚠️ Basic | ❌ | | B2B SaaS | ✅ Full | ✅ Full | ✅ Critical flows | ✅ Load | ✅ OWASP Top 10 | ✅ WCAG AA | | B2C high-traffic | ✅ Full | ✅ Full | ✅ Full | ✅ Stress + soak | ✅ Full | ✅ WCAG AA | | Financial/Health | ✅ Full + mutation | ✅ Full + contract | ✅ Full + chaos | ✅ Full suite | ✅ Pen test | ✅ WCAG AAA |
/ E2E \ 5-10% — Critical user journeys only
/ Integration \ 20-30% — API contracts, service boundaries
/ Unit Tests \ 60-70% — Business logic, pure functions
Anti-pattern: Ice cream cone — More E2E than unit tests. Slow, flaky, expensive. Fix by pushing test coverage DOWN the pyramid.
Anti-pattern: Hourglass — Lots of unit + E2E, no integration. Misses contract bugs between services.
Every unit test follows this structure:
describe('PricingCalculator', () => {
// Group by behavior, not by method
describe('when customer has volume discount', () => {
it('applies tiered pricing above threshold', () => {
// ARRANGE — Set up the scenario
const calculator = new PricingCalculator();
const customer = createCustomer({ tier: 'enterprise', units: 150 });
// ACT — Execute the behavior under test
const price = calculator.calculate(customer);
// ASSERT — Verify the outcome (ONE logical assertion)
expect(price).toEqual({
subtotal: 12000,
discount: 1800, // 15% volume discount
total: 10200,
});
});
});
});
Format: [unit] [scenario] [expected behavior]
✅ Good:
PricingCalculator applies 15% discount when units exceed 100UserService throws NotFoundError when user ID is invalidparseDate returns null for malformed ISO strings❌ Bad:
test1, should work, calculates pricecatch block, every validation path| Dependency Type | Strategy | Example |
|---|---|---|
| Database | Mock the repository/DAO | jest.mock('./userRepo') |
| HTTP API | Mock the client or use MSW | msw.http.get('/api/users', ...) |
| File system | Mock fs or use temp dirs | jest.mock('fs/promises') |
| Time/Date | Fake timers | jest.useFakeTimers() |
| Randomness | Seed or mock | jest.spyOn(Math, 'random') |
| Environment | Override env vars | process.env.NODE_ENV = 'test' |
Rule: Mock at boundaries, not internals. If you're mocking a class you own, your design might need refactoring.
| Metric | Minimum | Good | Excellent | |---|---|---|---| | Line coverage | 70% | 85% | 95%+ | | Branch coverage | 60% | 80% | 90%+ | | Function coverage | 75% | 90% | 95%+ | | Critical path coverage | 100% | 100% | 100% |
Warning: 100% coverage ≠ quality. Coverage measures what code ran, not what was verified. A test with no assertions has coverage but no value.
For every API endpoint, test:
endpoint: POST /api/orders
tests:
happy_path:
- Valid request returns 201 with order ID
- Response matches schema
- Database record created correctly
- Events/webhooks fired
validation:
- Missing required fields → 400 with field errors
- Invalid data types → 400 with type errors
- Business rule violations → 422 with explanation
authentication:
- No token → 401
- Expired token → 401
- Wrong role → 403
- Valid token → proceeds
edge_cases:
- Duplicate request (idempotency) → same response
- Concurrent requests → no race condition
- Maximum payload size → 413 or graceful handling
- Special characters in input → no injection
error_handling:
- Database down → 503 with retry hint
- External service timeout → 504 or fallback
- Rate limit exceeded → 429 with retry-after
When services communicate, test the contract:
contract:
consumer: order-service
provider: payment-service
interactions:
- description: "Process payment"
request:
method: POST
path: /payments
body:
amount: 99.99
currency: USD
order_id: "ord_123"
response:
status: 200
body:
payment_id: "pay_xxx" # string, not null
status: "completed" # enum: completed|pending|failed
breaking_changes: # NEVER do these without versioning
- Remove a field from response
- Change a field's type
- Add a required field to request
- Change the URL path
- Change error response format
createUser({ role: 'admin' }) > hardcoded SQL dumpsIdentify and test the flows that generate revenue or block users:
critical_journeys:
- name: "Sign up → First value"
steps:
- Visit landing page
- Click sign up
- Fill registration form
- Verify email
- Complete onboarding
- Perform first key action
max_duration: 3 minutes
- name: "Purchase flow"
steps:
- Browse products
- Add to cart
- Enter shipping
- Enter payment
- Confirm order
- Receive confirmation email
max_duration: 2 minutes
- name: "Login → Core task → Logout"
steps:
- Login (password + SSO + MFA variants)
- Navigate to core feature
- Complete primary workflow
- Verify result
- Logout
max_duration: 1 minute
waitFor(element) not sleep(3000)getByRole('button', { name: 'Submit' }) — Accessible, resilientgetByLabelText('Email') — Form-specific, accessiblegetByText('Welcome back') — Content-basedgetByTestId('submit-btn') — Explicit test hookquerySelector('.btn-primary') — ❌ Fragile, breaks on CSS changes| Symptom | Likely Cause | Fix | |---|---|---| | Passes locally, fails in CI | Timing/race condition | Add explicit waits, check CI resource limits | | Fails intermittently | Shared state between tests | Isolate test data, reset state | | Fails after deploy | Environment difference | Check env vars, API versions, feature flags | | Fails at specific time | Time-dependent logic | Mock dates/times, avoid time-sensitive assertions | | Fails in parallel | Resource contention | Use unique ports/DBs per worker |
Rule: Quarantine flaky tests within 24 hours. A flaky test suite that everyone ignores is worse than no tests.
performance_tests:
smoke:
vus: 5
duration: 1m
purpose: "Verify test works"
load:
vus: 100 # Expected concurrent users
duration: 10m
ramp_up: 2m
purpose: "Normal traffic behavior"
thresholds:
p95_response: <500ms
error_rate: <1%
stress:
vus: 300 # 3x expected load
duration: 15m
ramp_up: 5m
purpose: "Find breaking point"
soak:
vus: 80
duration: 2h
purpose: "Memory leaks, connection exhaustion"
spike:
stages:
- { vus: 50, duration: 2m }
- { vus: 500, duration: 30s } # Sudden spike
- { vus: 50, duration: 2m }
purpose: "Recovery behavior"
| Metric | Web App | API | Background Job | |---|---|---|---| | Response time (p50) | <200ms | <100ms | N/A | | Response time (p95) | <1s | <500ms | N/A | | Response time (p99) | <3s | <1s | N/A | | Throughput | >100 rps | >500 rps | >1000/min | | Error rate | <0.1% | <0.1% | <0.5% | | CPU usage | <70% | <70% | <90% | | Memory growth | <5%/hr | <2%/hr | <10%/hr |
db_performance:
query_tests:
- name: "Dashboard aggregate query"
baseline: 50ms
max_acceptable: 200ms
with_1M_rows: measure
with_10M_rows: measure
index_verification:
- Run EXPLAIN ANALYZE on all critical queries
- Verify no sequential scans on tables >10K rows
- Check index usage statistics weekly
connection_pool:
- Test at max connections
- Verify graceful handling when pool exhausted
- Monitor connection wait time
security_tests:
A01_broken_access_control:
- [ ] Horizontal privilege escalation (access other user's data)
- [ ] Vertical privilege escalation (access admin functions)
- [ ] IDOR (Insecure Direct Object References)
- [ ] Missing function-level access control
- [ ] CORS misconfiguration
A02_cryptographic_failures:
- [ ] Sensitive data in transit (TLS 1.2+)
- [ ] Sensitive data at rest (encryption)
- [ ] Password hashing (bcrypt/argon2, not MD5/SHA)
- [ ] No secrets in code/logs/URLs
A03_injection:
- [ ] SQL injection (parameterized queries)
- [ ] NoSQL injection
- [ ] Command injection (OS commands)
- [ ] XSS (stored, reflected, DOM-based)
- [ ] Template injection (SSTI)
A04_insecure_design:
- [ ] Rate limiting on auth endpoints
- [ ] Account lockout after N failures
- [ ] CAPTCHA on public forms
- [ ] Business logic abuse scenarios
A05_security_misconfiguration:
- [ ] Default credentials removed
- [ ] Error messages don't leak stack traces
- [ ] Security headers set (CSP, HSTS, X-Frame-Options)
- [ ] Directory listing disabled
- [ ] Unnecessary HTTP methods disabled
A07_auth_failures:
- [ ] Brute force protection
- [ ] Session fixation
- [ ] Session timeout
- [ ] JWT validation (signature, expiry, issuer)
- [ ] MFA bypass attempts
Test every user input with:
injection_payloads:
sql: ["' OR 1=1--", "'; DROP TABLE users;--", "1 UNION SELECT * FROM users"]
xss: ["<script>alert(1)</script>", "<img onerror=alert(1) src=x>", "javascript:alert(1)"]
path_traversal: ["../../etc/passwd", "..\\..\\windows\\system32", "%2e%2e%2f"]
command: ["; ls -la", "| cat /etc/passwd", "$(whoami)", "`id`"]
boundary_values:
strings: ["", " ", "a"*10000, null, undefined, "emoji: 🎯", "unicode: é à ü", "rtl: مرحبا"]
numbers: [0, -1, 2147483647, -2147483648, NaN, Infinity, 0.1+0.2]
arrays: [[], [null], Array(10000)]
dates: ["1970-01-01", "2099-12-31", "invalid-date", "2024-02-29", "2023-02-29"]
| Need | JavaScript/TS | Python | Go | Java | |---|---|---|---|---| | Unit | Vitest / Jest | pytest | testing + testify | JUnit 5 | | API | Supertest | httpx + pytest | net/http/httptest | RestAssured | | E2E (browser) | Playwright | Playwright | chromedp | Selenium | | Performance | k6 | Locust | vegeta | Gatling | | Contract | Pact | Pact | Pact | Pact | | Security | ZAP + custom | Bandit + custom | gosec | SpotBugs |
pipeline:
stage_1_fast: # <2 min, blocks PR
- Lint + type check
- Unit tests
- Security: dependency scan (npm audit / safety)
stage_2_thorough: # <10 min, blocks merge
- Integration tests
- Contract tests
- Security: SAST scan
- Coverage report + threshold check
stage_3_confidence: # <30 min, blocks deploy
- E2E critical journeys
- Visual regression (if applicable)
- Security: container scan
stage_4_post_deploy: # After deploy to staging
- Smoke tests against staging
- Performance baseline check
- Security: DAST scan (ZAP)
stage_5_production: # After prod deploy
- Smoke tests (critical paths only)
- Synthetic monitoring enabled
- Canary metrics watching
test_data_strategy:
unit_tests:
approach: factories # Builder pattern, create exactly what you need
example: "createUser({ role: 'admin', plan: 'enterprise' })"
integration_tests:
approach: seeded_database
reset: per_test_suite # Transaction rollback or truncate
sensitive_data: anonymized # Never use real PII
e2e_tests:
approach: api_setup # Create data via API before test
cleanup: after_each # Delete created data
isolation: unique_identifiers # Timestamp or UUID in test data
performance_tests:
approach: representative_dataset
volume: 10x_production # Test with more data than prod
generation: faker_libraries # Realistic but synthetic
metrics:
test_suite_health:
total_tests: 0
passing: 0
failing: 0
skipped: 0 # >5% skipped = tech debt alarm
flaky: 0 # >2% flaky = quarantine immediately
coverage:
line: "0%"
branch: "0%"
critical_paths: "0%" # Must be 100%
execution:
unit_duration: "0s" # Target: <30s
integration_duration: "0s" # Target: <5m
e2e_duration: "0s" # Target: <15m
total_ci_time: "0s" # Target: <20m
defect_metrics:
bugs_found_in_test: 0
bugs_escaped_to_prod: 0
escape_rate: "0%" # Target: <5%
mttr: "0h" # Mean time to resolve
trends: # Track weekly
new_tests_added: 0
tests_deleted: 0 # Healthy deletion = removing redundant tests
coverage_delta: "+0%"
flake_rate_delta: "+0%"
# Test Report — [Feature/Sprint/Release]
## Summary
- **Status:** ✅ PASS / ⚠️ PASS WITH RISKS / ❌ FAIL
- **Tests Run:** X | **Passed:** X | **Failed:** X | **Skipped:** X
- **Coverage:** Line X% | Branch X% | Critical 100%
- **Duration:** Xm Xs
## Key Findings
### 🔴 Critical (Block Release)
1. [Finding] — [Impact] — [Fix recommendation]
### 🟡 High (Fix Before Next Release)
1. [Finding] — [Impact] — [Fix recommendation]
### 🟢 Medium/Low (Backlog)
1. [Finding] — [Impact]
## Risk Assessment
- **Untested areas:** [list]
- **Known flaky tests:** [list with ticket IDs]
- **Performance concerns:** [if any]
## Recommendation
[Ship / Ship with monitoring / Hold for fixes]
| Dimension | Weight | Scoring | |---|---|---| | Test coverage | 20% | <60%=0, 60-70%=5, 70-80%=10, 80-90%=15, 90%+=20 | | Critical path coverage | 20% | <100%=0, 100%=20 | | Defect escape rate | 15% | >10%=0, 5-10%=5, 2-5%=10, <2%=15 | | Test suite speed | 10% | >30m=0, 20-30m=3, 10-20m=7, <10m=10 | | Flake rate | 10% | >5%=0, 2-5%=3, 1-2%=7, <1%=10 | | Security test coverage | 10% | None=0, Basic=3, OWASP Top 10=7, Full=10 | | Documentation | 5% | None=0, Basic=2, Complete=5 | | Automation ratio | 10% | <50%=0, 50-70%=3, 70-90%=7, 90%+=10 |
Scoring: 0-40 = 🔴 Critical | 41-60 = 🟡 Needs Work | 61-80 = 🟢 Good | 81-100 = 💎 Excellent
accessibility_checklist:
level_a: # Minimum compliance
- [ ] All images have alt text
- [ ] All form inputs have labels
- [ ] Color is not the only visual indicator
- [ ] Page has proper heading hierarchy (h1→h2→h3)
- [ ] All functionality available via keyboard
- [ ] Focus is visible and logical
- [ ] No content flashes >3 times/second
level_aa: # Standard compliance (recommended)
- [ ] Color contrast ratio ≥4.5:1 (normal text)
- [ ] Color contrast ratio ≥3:1 (large text)
- [ ] Text resizable to 200% without loss
- [ ] Skip navigation links
- [ ] Consistent navigation across pages
- [ ] Error suggestions provided
- [ ] ARIA landmarks for page regions
tools:
- axe-core (automated, catches ~30% of issues)
- Lighthouse accessibility audit
- Manual keyboard navigation test
- Screen reader testing (VoiceOver/NVDA)
compatibility_tests:
when_updating_api:
- [ ] All existing fields still present in response
- [ ] No field type changes (string→number)
- [ ] New required request fields have defaults
- [ ] Deprecated fields still work (with warning header)
- [ ] Error format unchanged
- [ ] Pagination behavior unchanged
- [ ] Rate limits not reduced
versioning_strategy:
- URL versioning: /v1/users, /v2/users
- Header versioning: Accept: application/vnd.api+json;version=2
- Sunset header for deprecated versions
- Minimum 6-month deprecation notice
chaos_tests:
network:
- Service dependency goes down → graceful degradation?
- Network latency increases 10x → timeout handling?
- DNS resolution fails → fallback behavior?
infrastructure:
- Database primary fails → replica promotion?
- Cache (Redis) goes down → DB fallback works?
- Disk fills up → alerting + graceful failure?
application:
- Memory pressure → OOM handling?
- CPU saturation → request queuing?
- Certificate expiry → monitoring alert?
data:
- Corrupt message in queue → dead letter + alert?
- Schema migration fails mid-way → rollback works?
- Clock skew between services → idempotency holds?
weekly_review:
monday:
- Review flaky test quarantine — fix or delete
- Check coverage trends — declining = tech debt
- Review escaped defects — update test strategy
friday:
- Update test health dashboard
- Clean up obsolete tests
- Document new testing patterns discovered
- Plan next week's testing focus
"Create test strategy for [project/feature]" → Full strategy brief"Write unit tests for [function/class]" → AAA pattern tests with edge cases"Test this API endpoint: [method] [path]" → Full API test checklist"Review these tests for quality" → Test code review with scoring"Generate performance test plan" → k6/Locust test design"Security test [feature/endpoint]" → OWASP-based test checklist"Create test report for [release]" → Formatted test report"What's our test health?" → Dashboard with metrics and recommendations"Find gaps in our test coverage" → Analysis with prioritized recommendations"Help debug this flaky test" → Root cause analysis with fix suggestions"Set up CI test pipeline" → Stage-by-stage pipeline config"Accessibility audit [page/component]" → WCAG checklist with findingstools
Use when the user wants to connect to, test, or use the McDonalds service at mcp.mcd.cn, including checking authentication, probing MCP endpoints, listing tools, or calling McDonalds MCP tools through a reusable local CLI.
development
Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API
development
SlowMist AI Agent Security Review — comprehensive security framework for skills, repositories, URLs, on-chain addresses, and products (Claude Code version)
data-ai
去除中文文本中的 AI 写作痕迹,使其读起来自然。基于维基百科 AI 写作特征指南,检测 24 种 AI 模式。触发词:humanizer-cn、去除 AI 痕迹、去除 AI 写作痕迹、中文文本人性化。