.claude/skills/harness-eval/SKILL.md
Structured SE task evaluation using 15 benchmark definitions from claude-code-harness research
npx skillsauth add baekenough/oh-my-customcode harness-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate agent quality using 15 structured software engineering task definitions with quantitative scoring. Based on research from revfactory/claude-code-harness which demonstrated 60% improvement (49.5 → 79.3 points) through structured pre-configuration.
/omcustom:harness-eval # Run all 15 benchmarks
/omcustom:harness-eval --preset quick # Run top 5 high-impact benchmarks
/omcustom:harness-eval --task api-design # Run specific task benchmark
| Dimension | Weight | Description | |-----------|--------|-------------| | Test Coverage | 30% | Unit test count, edge case coverage, assertion quality | | Architecture Design | 25% | Separation of concerns, dependency management, scalability | | Error Handling | 25% | Input validation, error propagation, recovery strategies | | Extensibility | 20% | Plugin points, configuration flexibility, API surface |
| # | Task | Category | Key Evaluation Criteria | |---|------|----------|------------------------| | 1 | API Design | Architecture | RESTful conventions, versioning, error responses | | 2 | Data Modeling | Architecture | Schema normalization, relationships, indexing | | 3 | Authentication Flow | Security | Token management, session handling, OWASP compliance | | 4 | Test Suite Creation | Quality | Coverage breadth, assertion quality, edge cases | | 5 | Error Handler | Reliability | Error classification, recovery, user feedback | | 6 | Logging System | Observability | Structured logging, levels, correlation IDs | | 7 | Configuration Manager | Operations | Env-based config, validation, secrets handling | | 8 | CLI Tool | UX | Argument parsing, help text, exit codes | | 9 | Database Migration | Data | Reversibility, data preservation, zero-downtime | | 10 | Cache Layer | Performance | Invalidation strategy, TTL, cache-aside pattern | | 11 | Queue Consumer | Reliability | Idempotency, retry logic, dead letter handling | | 12 | Middleware Chain | Architecture | Composability, ordering, short-circuiting | | 13 | File Processor | I/O | Streaming, error recovery, format validation | | 14 | Webhook Handler | Integration | Signature verification, retry tolerance, idempotency | | 15 | Rate Limiter | Security | Algorithm choice, distributed state, fairness |
Each task is scored 0-100 across the 4 quality dimensions:
Score = (test_coverage × 0.30) + (architecture × 0.25) + (error_handling × 0.25) + (extensibility × 0.20)
| Score Range | Grade | Interpretation | |-------------|-------|----------------| | 80-100 | A | Production-ready, well-structured | | 60-79 | B | Functional with minor gaps | | 40-59 | C | Works but needs improvement | | 0-39 | D | Significant structural issues |
all (default)Run all 15 tasks. Full evaluation ~45 minutes.
quickRun top 5 high-impact tasks (1, 3, 4, 5, 12). Quick evaluation ~15 minutes.
This skill provides preset rubrics for the evaluator-optimizer pipeline:
/omcustom:harness-eval → loads rubric → evaluator-optimizer executes → scoring → report
The evaluator-optimizer skill's pre_negotiation phase accepts harness-eval rubric dimensions as sprint contract criteria.
Results saved to .claude/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md with per-task scores and aggregate grade.
Evaluation framework based on research by revfactory/claude-code-harness. Adapted for oh-my-customcode's evaluator-optimizer pipeline with permission.
development
Generate and maintain a persistent codebase wiki — LLM-built interlinked markdown knowledge base (Karpathy LLM Wiki pattern)
development
Use the project wiki as RAG knowledge source — search wiki pages to answer codebase questions before exploring raw files
tools
Analyze task trajectories to propose reusable SKILL.md candidates from successful patterns
data-ai
hada.io RSS feed monitoring for AI agent/harness articles with automated /scout analysis