toolkit/packages/skills/validate/SKILL.md
Verify implementation against specifications
npx skillsauth add stevengonsalvez/agents-in-a-box validateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are tasked with validating that an implementation plan was correctly executed, verifying all success criteria and identifying any deviations or issues.
<!-- recall:begin -->Before validating, recall prior learnings from the global knowledge base so we don't re-learn or re-decide something already captured:
uv run "{{HOME_TOOL_DIR}}/skills/recall/scripts/recall.py" \
"<QUERY>" \
--limit 5 --format markdown
Query construction for /validate: the feature/change being validated + relevant verification keywords (e.g. "OAuth callback validation edge cases").
What to do with results:
When invoked:
Determine context:
Locate the plan:
plans/ directory for recent plansRespond appropriately:
I'll validate the implementation against the plan.
[If in existing conversation]:
I'll verify the work we just completed.
[If starting fresh]:
Please provide the plan file to validate against, or I can check for recently implemented plans.
If starting fresh or need more context:
Read the implementation plan completely
Identify what should have changed:
Spawn parallel research tasks to discover implementation:
Task 1 - Verify database changes:
Research if migration [N] was added and schema changes match plan.
Check: migration files, schema version, table structure
Return: What was implemented vs what plan specified
Task 2 - Verify code changes:
Find all modified files related to [feature].
Compare actual changes to plan specifications.
Return: File-by-file comparison of planned vs actual
Task 3 - Verify test coverage:
Check if tests were added/modified as specified.
Run test commands and capture results.
Return: Test status and any missing coverage
For each phase in the plan:
Check completion status:
Run automated verification:
Assess manual criteria:
Think deeply about edge cases:
Go beyond task completion -- verify the codebase actually delivers what was promised.
Extract must-haves from the plan:
Three-level artifact analysis for each required artifact:
| Level | Check | Question |
|-------|-------|----------|
| Exists | File/function present | Does the file exist? Is the function defined? |
| Substantive | Real code, not stub | Is this actual implementation or placeholder? |
| Wired | Connected and used | Is this imported, called, routed, rendered? |
Stub detection -- flag these patterns as NOT substantive:
return null, return {}, return undefined// TODO, // FIXME, // placeholderthrow new Error("Not implemented")pass in PythononClick={() => {}}, empty event handlersResponse.json({ message: "Not implemented" })Produce verification matrix:
## Goal-Backward Verification
| Truth | Artifact | Exists | Substantive | Wired | Status |
|-------|----------|--------|-------------|-------|--------|
| User can login | src/auth/login.ts | Y | Y | Y | VERIFIED |
| Session persists | src/auth/session.ts | Y | Y | N | ORPHANED |
| Errors display | src/components/Error.tsx | Y | N | - | STUB |
| Rate limited | src/middleware/rate.ts | N | - | - | MISSING |
Gap structuring -- for any non-VERIFIED items:
### Gaps Found
#### Gap 1: Session persistence (ORPHANED)
- **Truth**: "Session persists across page reloads"
- **Issue**: session.ts exists and is substantive but never imported in app routes
- **Fix**: Import and wire session middleware in src/app.ts
#### Gap 2: Error display (STUB)
- **Truth**: "Errors display to user"
- **Issue**: Error.tsx returns null -- placeholder component
- **Fix**: Implement error rendering with message prop
Include in validation report (Step 4) under a "Goal-Backward Verification" section
Spawn parallel Task agents for thorough review:
Task 1: "Review implementation of [Phase 1 feature]"
- Check if implementation matches plan specifications
- Verify error handling and edge cases
- Look for potential bugs or issues
- Return specific findings with file:line references
Task 2: "Verify test coverage for [feature]"
- Check if tests were added as specified
- Verify test quality and coverage
- Look for missing test cases
- Return test file locations and coverage gaps
Task 3: "Check for regressions in [related component]"
- Verify existing functionality still works
- Check for breaking changes
- Look for unintended side effects
- Return any concerning changes
Create comprehensive validation summary:
# Validation Report: [Plan Name]
**Date**: [Current date and time]
**Plan**: plans/[plan_file].md
**Validation Type**: [Automated | Manual | Comprehensive]
## Implementation Status
### Phase Completion
Phase 1: [Name] - Fully implemented
Phase 2: [Name] - Fully implemented
Phase 3: [Name] - Partially implemented (see issues)
### Files Modified
`src/auth/oauth.js` - Added as specified
`src/config/auth.config.js` - Updated correctly
`tests/auth.test.js` - Missing test cases
## Automated Verification Results
### Build & Compilation
Build passes: `npm run build` (2.3s)
TypeScript: No errors
### Tests
Unit tests: 142 passing
Integration tests: 2 failing
- Error in OAuth callback test (timeout)
- Error in token refresh test (undefined variable)
### Code Quality
Linting: Clean (ESLint)
Coverage: 78% (target was 80%)
## Code Review Findings
### Matches Plan
- OAuth2 flow implemented correctly
- Token storage follows security requirements
- Error handling comprehensive
### Deviations from Plan
- Used different library than specified (passport vs manual)
- Impact: Positive - more maintainable
- Added extra validation not in plan
- Impact: Positive - improved security
### Potential Issues
**Performance**: Token validation happens on every request
- Recommendation: Add caching layer
**Security**: Refresh tokens stored in localStorage
- Recommendation: Use httpOnly cookies
**Missing**: Rate limiting not implemented
- Required by plan Phase 2
## Manual Testing Required
The following require manual verification:
### User Interface
- [ ] OAuth login button appears correctly
- [ ] Redirect flow works smoothly
- [ ] Error messages display appropriately
### Integration Testing
- [ ] Works with Google OAuth provider
- [ ] Works with GitHub OAuth provider
- [ ] Token refresh happens seamlessly
### Performance Testing
- [ ] Login completes within 3 seconds
- [ ] Handles 100 concurrent logins
## Recommendations
### Immediate Actions
1. Fix failing integration tests
2. Add missing rate limiting
3. Increase test coverage to 80%
### Before Production
1. Move refresh tokens to secure storage
2. Add token validation caching
3. Complete manual testing checklist
### Nice to Have
1. Add OAuth provider abstraction
2. Implement token rotation
3. Add audit logging
## Summary
**Overall Status**: **Mostly Complete**
The implementation follows the plan well with some positive deviations. However, there are 2 failing tests and missing rate limiting that must be addressed before this can be considered complete.
**Next Steps**:
1. Fix the failing integration tests
2. Implement rate limiting (Phase 2 requirement)
3. Complete manual testing
4. Address security recommendations
Save to: validation/YYYY-MM-DD_HH-MM-SS_planname.md
After presenting the report, ask:
Would you like me to:
1. Help fix the failing tests?
2. Implement the missing features?
3. Run specific additional checks?
4. Generate a checklist for manual testing?
Always verify:
# JavaScript/TypeScript
npm test
npm run lint
npm run typecheck
npm run build
# Python
pytest
python -m pytest --cov
flake8
mypy
# Go
go test ./...
go vet ./...
golangci-lint run
# Rust
cargo test
cargo clippy
cargo fmt --check
# Generic
make test
make check
make lint
If validation reveals issues:
Categorize by severity:
Provide fixes when possible:
Update the plan:
Remember: Good validation catches issues before they reach production. Be constructive but thorough in identifying gaps or improvements.
documentation
Report reflect drain spend over a time window — tokens split by cached (cache_read), uncached writes (cache_creation), and io (input+output), with a $ estimate, grouped by day / outcome / model / transcript. Reads the drainer's cost log and surfaces outlier runs and cache-reuse health (the 41.5M-token failure mode = low cache reuse + high cache writes). Use to answer "what is reflection costing me" for the last day / week.
development
Show fleet status — every claude session running on the host, merged across ainb + claude-peers broker + background jobs. Use when you need to enumerate sessions before composing an action, see which sessions have a peer registered (broker-routable) vs tmux-only, check the `summary` of each session, or pipe the list into jq for filtering. Default output: text table. Pass --format json for LLM consumption.
testing
Ordered multi-step prompts to fleet targets, ack-gated between steps via JSONL assistant-turn-end detection. Use for cycles like disconnect→reconnect→verify, or any flow where step N+1 requires step N to have completed first. The skill BLOCKS until each target's transcript shows the next assistant turn finishing OR per-step timeout fires (default 300s).
development
Center control panel — enumerate every claude session that is blocked waiting on something: a user answer (AskUserQuestion fired), an API error retry, an idle assistant turn-end with no follow-up, or an explicit WAITING: marker. Returns rich JSON with signal kind + context per session. Use this when you've stepped away from the fleet and want one place to see everything that wants your attention and answer it.