.claude/skills/multi-ai-verification/SKILL.md
Multi-layer quality assurance with 5-layer verification pyramid (Rules → Functional → Visual → Integration → Quality Scoring). Independent verification with LLM-as-judge and Agent-as-a-Judge patterns. Score 0-100 with ≥90 threshold. Use when verifying code quality, security scanning, preventing test gaming, comprehensive QA, or ensuring production readiness through multi-layer validation.
npx skillsauth add adaptationio/skrillz multi-ai-verificationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
multi-ai-verification provides comprehensive quality assurance through a 5-layer verification pyramid, from automated rules to LLM-as-judge evaluation.
Purpose: Multi-layer independent verification ensuring production-ready quality
Pattern: Task-based (5 independent verification operations, one per layer)
Key Innovation: 5-layer pyramid (95% automated at base → 0% at apex) with independent verification preventing bias and test gaming
Core Principles (validated by tri-AI research):
Quality Gates: All 5 layers must pass for production approval
Use multi-ai-verification when:
Layer 5: Quality Scoring
(LLM-as-Judge, 0-20% automated)
/\
/ \
Layer 4: Integration
(E2E, System, 20-30% automated)
/ \
/ \
Layer 3: Visual
(UI, Screenshots, 30-50% automated)
/ \
/ \
Layer 2: Functional
(Tests, Coverage, 60-80% automated)
/ \
/ \
Layer 1: Rules-Based
(Linting, Types, Schema, 95% automated)
Principle: Fail fast at automated layers (cheap, fast) before expensive LLM-as-judge evaluation
Purpose: Automated validation of code structure, formatting, types
Automation: 95% automated Speed: Seconds (fast feedback) Confidence: High (deterministic)
Process:
Schema Validation (if applicable):
# Validate JSON/YAML against schemas
ajv validate -s plan.schema.json -d plan.json
ajv validate -s task.schema.json -d tasks/*.json
Linting:
# JavaScript/TypeScript
npx eslint src/**/*.{ts,tsx,js,jsx}
# Python
pylint src/**/*.py
# Expected: Zero linting errors
Type Checking:
# TypeScript
npx tsc --noEmit
# Python
mypy src/
# Expected: Zero type errors
Format Validation:
# Check formatting
npx prettier --check src/**/*.{ts,tsx}
# Or auto-fix
npx prettier --write src/**/*.{ts,tsx}
Security Scanning (SAST):
# Static security analysis
npx semgrep --config=auto src/
# Or for Python
bandit -r src/
# Check for:
# - Hardcoded secrets
# - SQL injection risks
# - XSS vulnerabilities
# - Insecure dependencies
Generate Layer 1 Report:
# Layer 1: Rules-Based Verification
## Schema Validation
✅ plan.json validates
✅ All task files validate
## Linting
✅ 0 linting errors
⚠️ 3 warnings (non-blocking)
## Type Checking
✅ 0 type errors
## Formatting
✅ All files formatted correctly
## Security Scan (SAST)
✅ No critical vulnerabilities
⚠️ 1 medium: Weak password hashing rounds (bcrypt)
**Layer 1 Status**: ✅ PASS (0 critical issues)
**Issues to Address**: 1 medium security issue
Outputs:
Validation:
Time Estimate: 15-30 minutes (mostly automated)
Gate 1: ✅ PASS if no critical issues (warnings acceptable)
Purpose: Validate functionality through test execution and coverage
Automation: 60-80% automated Speed: Minutes (medium feedback) Confidence: High (measurable outcomes)
Process:
Execute Complete Test Suite:
# Run all tests with coverage
npm test -- --coverage --verbose
# Capture results
# - Tests passed/failed
# - Coverage metrics
# - Execution time
Validate Example Code (from documentation):
# Extract examples from SKILL.md
# Execute each example automatically
# Verify outputs match expected
# Target: ≥90% examples work
Check Coverage:
# Coverage Report
**Line Coverage**: 87% ✅ (gate: ≥80%)
**Branch Coverage**: 82% ✅
**Function Coverage**: 92% ✅
**Path Coverage**: 74% ✅
**Gate Status**: PASS ✅ (all ≥80%)
**Uncovered Code**:
- src/admin/legacy.ts: 23% (low priority)
- src/utils/deprecated.ts: 15% (deprecated, ok)
Regression Testing (for updates):
# Compare before/after
git diff main...feature --stat
# Run all tests
npm test
# Verify: No new failures (regression prevention)
Performance Validation:
# Run performance tests
npm run test:performance
# Check response times
# Verify: Within acceptable ranges
Generate Layer 2 Report:
# Layer 2: Functional Verification
## Test Execution
✅ 245/245 tests passing (100%)
⏱️ Execution time: 8.3 seconds
## Coverage
✅ Line: 87% (gate: ≥80%)
✅ Branch: 82%
✅ Function: 92%
## Example Validation
✅ 18/20 examples work (90%)
❌ 2 examples fail (outdated)
## Regression
✅ All existing tests still pass
## Performance
✅ All endpoints <200ms
**Layer 2 Status**: ✅ PASS
**Issues**: 2 outdated examples (update docs)
Outputs:
Validation:
Time Estimate: 30-60 minutes
Gate 2: ✅ PASS if tests pass + coverage ≥80%
Purpose: Validate UI appearance, layout, accessibility (for UI features)
Automation: 30-50% automated Speed: Minutes-Hours Confidence: Medium (subjective elements)
Process:
Screenshot Generation:
# Generate screenshots of UI
npx playwright test --screenshot=on
# Or manually:
# Open application
# Capture screenshots of key views
Visual Comparison (if previous version exists):
# Compare against baseline
npx playwright test --update-snapshots=missing
# Or use Percy/Chromatic for visual regression
npx percy snapshot screenshots/
Layout Validation:
# Visual Checklist
## Layout
- [ ] Components positioned correctly
- [ ] Spacing/margins match mockup
- [ ] Alignment proper
- [ ] No overlapping elements
## Styling
- [ ] Colors match design system
- [ ] Typography correct (fonts, sizes)
- [ ] Icons/images display properly
## Responsiveness
- [ ] Mobile view (320px-480px): ✅
- [ ] Tablet view (768px-1024px): ✅
- [ ] Desktop view (>1024px): ✅
Accessibility Testing:
# Automated accessibility scan
npx axe-core src/
# Check WCAG compliance
npx pa11y http://localhost:3000
# Manual checks:
# - Keyboard navigation
# - Screen reader compatibility
# - Color contrast ratios
Generate Layer 3 Report:
# Layer 3: Visual Verification
## Screenshot Comparison
✅ Login page matches mockup
✅ Dashboard layout correct
⚠️ Profile page: Avatar alignment off by 5px
## Responsiveness
✅ Mobile: All components visible
✅ Tablet: Layout adapts correctly
✅ Desktop: Full functionality
## Accessibility
✅ WCAG 2.1 AA compliance
✅ Keyboard navigation works
⚠️ 2 color contrast warnings (non-critical)
**Layer 3 Status**: ✅ PASS (minor issues acceptable)
**Issues**: Avatar alignment (cosmetic), contrast warnings
Outputs:
Validation:
Time Estimate: 30-90 minutes (skip if no UI)
Gate 3: ✅ PASS if no critical visual/a11y issues
Purpose: Validate system-level integration, data flow, API compatibility
Automation: 20-30% automated Speed: Hours (complex) Confidence: Medium-High
Process:
Component Integration Tests:
# Run integration test suite
npm test -- tests/integration/
# Verify components work together
# - Database ← → API
# - API ← → Frontend
# - Frontend ← → User
Data Flow Validation:
# Data Flow Verification
**Flow 1: User Registration**
Frontend form → API endpoint → Validation → Database → Email service
✅ Data flows correctly
✅ No data loss
✅ Transactions atomic
**Flow 2: Authentication**
Login request → API → Database lookup → Token generation → Response
✅ Token generated correctly
✅ Session stored
✅ Response includes token
API Integration Tests:
# Test all API endpoints
npm run test:api
# Verify:
# - All endpoints respond
# - Status codes correct
# - Response formats match spec
# - Error handling works
End-to-End Workflow Tests:
// Complete user journeys
test('Complete registration and login flow', async () => {
// 1. Register new user
const registerResponse = await api.post('/register', userData);
expect(registerResponse.status).toBe(201);
// 2. Confirm email
const confirmResponse = await api.get(confirmLink);
expect(confirmResponse.status).toBe(200);
// 3. Login
const loginResponse = await api.post('/login', credentials);
expect(loginResponse.status).toBe(200);
expect(loginResponse.data.token).toBeDefined();
// 4. Access protected resource
const profileResponse = await api.get('/profile', {
headers: { Authorization: `Bearer ${loginResponse.data.token}` }
});
expect(profileResponse.status).toBe(200);
});
Dependency Compatibility:
# Check external dependencies work
npm audit
# Check for breaking changes
npm outdated
# Verify integration with services
# - Database connection
# - Redis/cache
# - External APIs
Generate Layer 4 Report:
# Layer 4: Integration Verification
## Component Integration
✅ 12/12 integration tests passing
✅ All components integrate correctly
## Data Flow
✅ All 5 data flows validated
✅ No data loss or corruption
## API Integration
✅ All 15 endpoints functional
✅ Response formats correct
✅ Error handling works
## E2E Workflows
✅ 8/8 user journeys complete successfully
✅ No workflow breaks
## Dependencies
✅ 0 critical vulnerabilities
⚠️ 2 moderate (non-blocking)
**Layer 4 Status**: ✅ PASS
Outputs:
Validation:
Time Estimate: 45-90 minutes
Gate 4: ✅ PASS if all integration tests pass, no critical dependencies
Purpose: Holistic quality assessment using LLM-as-judge and Agent-as-a-Judge patterns
Automation: 0-20% automated Speed: Hours (expensive) Confidence: Medium (requires judgment)
Process:
Spawn Independent Quality Assessor (Agent-as-a-Judge):
Key: Use different model family if possible (prevent self-preference bias)
const qualityAssessment = await task({
description: "Assess code quality holistically",
prompt: `Evaluate code quality in src/ and tests/.
DO NOT read implementation conversation history.
You have access to tools:
- Read files
- Execute tests
- Run linters
- Query database (if needed)
Assess 5 dimensions (score each /20):
1. CORRECTNESS (/20):
- Logic correctness
- Edge case handling
- Error handling completeness
- Security considerations
2. FUNCTIONALITY (/20):
- Meets all requirements
- User workflows work
- Performance acceptable
- No regressions
3. QUALITY (/20):
- Code maintainability
- Best practices followed
- Anti-patterns avoided
- Documentation complete
4. INTEGRATION (/20):
- Components integrate smoothly
- API contracts correct
- Data flow works
- Backward compatible
5. SECURITY (/20):
- No vulnerabilities
- Input validation
- Authentication/authorization
- Data protection
TOTAL: /100 (sum of 5 dimensions)
For each dimension, provide:
- Score (/20)
- Strengths (what's good)
- Weaknesses (what needs improvement)
- Evidence (file:line references)
- Recommendations (specific, actionable)
Write comprehensive report to: quality-assessment.md`
});
Multi-Agent Ensemble (for critical features):
3-5 Agent Voting Committee:
// Spawn 3 independent quality assessors
const [judge1, judge2, judge3] = await Promise.all([
task({description: "Quality Judge 1", prompt: assessmentPrompt}),
task({description: "Quality Judge 2", prompt: assessmentPrompt}),
task({description: "Quality Judge 3", prompt: assessmentPrompt})
]);
// Aggregate scores
const scores = {
correctness: median([judge1.correctness, judge2.correctness, judge3.correctness]),
functionality: median([...]),
quality: median([...]),
integration: median([...]),
security: median([...])
};
const totalScore = sum(Object.values(scores)); // Total /100
// Check variance
const totalScores = [judge1.total, judge2.total, judge3.total];
const variance = max(totalScores) - min(totalScores);
if (variance > 15) {
// High disagreement → spawn 2 more judges (total 5)
// Use 5-agent ensemble for final score
}
// Final score: median of 3 or 5
Calibration Against Rubric:
# Scoring Calibration
## Correctness: 18/20 (Excellent)
**20**: Zero errors, all edge cases handled perfectly
**18**: Minor edge case missing, otherwise excellent ✅ (achieved)
**15**: 1-2 significant edge cases missing
**10**: Some logic errors present
**0**: Major functionality broken
**Evidence**: All tests pass, edge cases covered except timezone DST edge case (minor)
## Functionality: 19/20 (Excellent)
[Similar rubric with evidence]
## Quality: 17/20 (Good)
[Similar rubric with evidence]
## Integration: 18/20 (Excellent)
[Similar rubric with evidence]
## Security: 16/20 (Good)
[Similar rubric with evidence]
**Total**: 88/100 ⚠️ (Below ≥90 gate)
Gap Analysis (if <90):
# Quality Gap Analysis
**Current Score**: 88/100
**Target**: ≥90/100
**Gap**: 2 points
## Critical Gaps (Blocking Approval)
None
## High Priority (Should Fix for ≥90)
1. **Security: Weak bcrypt rounds**
- **What**: bcrypt using 10 rounds (outdated)
- **Where**: src/auth/hash.ts:15
- **Why**: Current standard is 12-14 rounds
- **How**: Change `bcrypt.hash(password, 10)` to `bcrypt.hash(password, 12)`
- **Priority**: High
- **Impact**: +2 points → 90/100
## Medium Priority
1. **Quality: Missing JSDoc for 3 functions**
- Impact: +1 point → 91/100
**Recommendation**: Fix high priority issue to reach ≥90 threshold
**Estimated Effort**: 15 minutes
Generate Comprehensive Quality Report:
# Layer 5: Quality Scoring Report
## Executive Summary
**Total Score**: 88/100 ⚠️ (Below ≥90 gate)
**Status**: NEEDS MINOR REVISION
## Dimension Scores
- Correctness: 18/20 ⭐⭐⭐⭐⭐
- Functionality: 19/20 ⭐⭐⭐⭐⭐
- Quality: 17/20 ⭐⭐⭐⭐
- Integration: 18/20 ⭐⭐⭐⭐⭐
- Security: 16/20 ⭐⭐⭐⭐
## Strengths
1. Comprehensive test coverage (87%)
2. All functionality working correctly
3. Clean integration with all components
4. Good error handling
## Weaknesses
1. Bcrypt rounds below current standard (security)
2. Missing documentation for helper functions (quality)
3. One timezone edge case not handled (correctness)
## Recommendations (Prioritized)
### Priority 1 (High - Needed for ≥90)
1. Increase bcrypt rounds: 10 → 12
- File: src/auth/hash.ts:15
- Effort: 5 min
- Impact: +2 points
### Priority 2 (Medium - Nice to Have)
1. Add JSDoc to helper functions
- Files: src/utils/validation.ts
- Effort: 30 min
- Impact: +1 point
2. Handle timezone DST edge case
- File: src/auth/tokens.ts:78
- Effort: 20 min
- Impact: +1 point
**Next Steps**: Apply Priority 1 fix, re-verify to reach ≥90
Outputs:
Validation:
Time Estimate: 60-120 minutes (ensemble adds 30-60 min)
Gate 5: ✅ PASS if total score ≥90/100
All 5 Gates Must Pass for production approval:
Gate 1: Rules Pass ✅
↓ (Linting, types, schema, security)
Gate 2: Tests Pass ✅
↓ (All tests, coverage ≥80%)
Gate 3: Visual OK ✅
↓ (UI validated, a11y checked)
Gate 4: Integration OK ✅
↓ (E2E works, APIs integrate)
Gate 5: Quality ≥90 ✅
↓ (LLM-as-judge score ≥90/100)
✅ PRODUCTION APPROVED
If Any Gate Fails:
Failed Gate → Gap Analysis → Apply Fixes → Re-Verify → Repeat Until Pass
Verification Agent Spawning:
// After implementation and testing complete
const verification = await task({
description: "Independent quality verification",
prompt: `Verify code quality independently.
DO NOT read prior conversation history.
Review:
- Code: src/**/*.ts
- Tests: tests/**/*.test.ts
- Specs: specs/requirements.md
Verify against specifications ONLY (not implementation decisions).
Use tools:
- Read files to inspect code
- Run tests to verify functionality
- Execute linters for quality checks
Score quality (0-100) with evidence.
Write report to: independent-verification.md`
});
Bias Prevention Checklist:
Validation of Independence:
## Independence Audit
**Expected Behavior**:
- ✅ Verifier finds 1-3 issues (healthy skepticism)
- ✅ Verifier references specifications
- ✅ Verifier uses tools to verify claims
**Warning Signs**:
- ⚠️ Verifier finds 0 issues (possible rubber stamp)
- ⚠️ Verifier doesn't use tools
- ⚠️ Verifier parrots implementation justifications
**If Warning**: Re-verify with stronger independence prompt
20 (Perfect): Zero logic errors, all edge cases handled, security perfect 18 (Excellent): 1 minor edge case missing, otherwise flawless 15 (Good): 2-3 edge cases missing, no critical errors 12 (Acceptable): Some edge cases missing, 1 minor logic issue 10 (Needs Work): Multiple edge cases missing or 1 significant logic error 5 (Poor): Major logic errors present 0 (Broken): Critical functionality broken
20: All requirements met, exceeds expectations 18: All requirements met, well implemented 15: All requirements met, basic implementation 12: 1 requirement partially missing 10: 2+ requirements partially missing 5: Several requirements not met 0: Core functionality missing
20: Exceptional code quality, best practices exemplified 18: High quality, follows best practices 15: Good quality, minor style issues 12: Acceptable quality, several style issues 10: Below standard, needs refactoring 5: Poor quality, significant issues 0: Unmaintainable code
20: Perfect integration, all touch points verified 18: Excellent integration, minor docs needed 15: Good integration, all major points work 12: Acceptable, 1-2 integration issues 10: Integration issues present 5: Multiple integration problems 0: Does not integrate
20: Passes all security scans, OWASP compliant, hardened 18: Passes scans, 1 minor non-critical issue 15: Passes, 2-3 minor issues 12: 1 medium security issue 10: Multiple medium issues 5: 1 critical issue present 0: Multiple critical vulnerabilities
Linting:
Type Checking:
Security (SAST):
Visual Testing:
Coverage:
Budget Caps:
Optimization:
| Layer | Purpose | Automation | Time | Tools | |-------|---------|------------|------|-------| | 1 | Rules-based | 95% | 15-30m | Linters, types, SAST | | 2 | Functional | 60-80% | 30-60m | Test execution, coverage | | 3 | Visual | 30-50% | 30-90m | Screenshots, a11y | | 4 | Integration | 20-30% | 45-90m | E2E, API tests | | 5 | Quality Scoring | 0-20% | 60-120m | LLM-as-judge, ensemble |
Total: 3-6 hours for complete 5-layer verification
All 5 Must Pass:
multi-ai-verification provides comprehensive, multi-layer quality assurance with independent LLM-as-judge evaluation, ensuring production-ready code through systematic verification from automated rules to holistic quality assessment.
For rubrics, see Appendix B. For independence protocol, see Appendix A.
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.