Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

baekenough/harness-eval

Name: harness-eval
Author: baekenough

.claude/skills/harness-eval/SKILL.md

npx skillsauth add baekenough/oh-my-customcode harness-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Harness Eval — Structured SE Task Benchmark

Purpose

Evaluate agent quality using 15 structured software engineering task definitions with quantitative scoring. Based on research from revfactory/claude-code-harness which demonstrated 60% improvement (49.5 → 79.3 points) through structured pre-configuration.

Usage

/omcustom:harness-eval                    # Run all 15 benchmarks
/omcustom:harness-eval --preset quick     # Run top 5 high-impact benchmarks
/omcustom:harness-eval --task api-design  # Run specific task benchmark

Quality Dimensions

| Dimension | Weight | Description | |-----------|--------|-------------| | Test Coverage | 30% | Unit test count, edge case coverage, assertion quality | | Architecture Design | 25% | Separation of concerns, dependency management, scalability | | Error Handling | 25% | Input validation, error propagation, recovery strategies | | Extensibility | 20% | Plugin points, configuration flexibility, API surface |

15 SE Task Benchmark Suite

| # | Task | Category | Key Evaluation Criteria | |---|------|----------|------------------------| | 1 | API Design | Architecture | RESTful conventions, versioning, error responses | | 2 | Data Modeling | Architecture | Schema normalization, relationships, indexing | | 3 | Authentication Flow | Security | Token management, session handling, OWASP compliance | | 4 | Test Suite Creation | Quality | Coverage breadth, assertion quality, edge cases | | 5 | Error Handler | Reliability | Error classification, recovery, user feedback | | 6 | Logging System | Observability | Structured logging, levels, correlation IDs | | 7 | Configuration Manager | Operations | Env-based config, validation, secrets handling | | 8 | CLI Tool | UX | Argument parsing, help text, exit codes | | 9 | Database Migration | Data | Reversibility, data preservation, zero-downtime | | 10 | Cache Layer | Performance | Invalidation strategy, TTL, cache-aside pattern | | 11 | Queue Consumer | Reliability | Idempotency, retry logic, dead letter handling | | 12 | Middleware Chain | Architecture | Composability, ordering, short-circuiting | | 13 | File Processor | I/O | Streaming, error recovery, format validation | | 14 | Webhook Handler | Integration | Signature verification, retry tolerance, idempotency | | 15 | Rate Limiter | Security | Algorithm choice, distributed state, fairness |

Scoring Rubric

Each task is scored 0-100 across the 4 quality dimensions:

Score = (test_coverage × 0.30) + (architecture × 0.25) + (error_handling × 0.25) + (extensibility × 0.20)

Score Thresholds

| Score Range | Grade | Interpretation | |-------------|-------|----------------| | 80-100 | A | Production-ready, well-structured | | 60-79 | B | Functional with minor gaps | | 40-59 | C | Works but needs improvement | | 0-39 | D | Significant structural issues |

Presets

`all` (default)

Run all 15 tasks. Full evaluation ~45 minutes.

`quick`

Run top 5 high-impact tasks (1, 3, 4, 5, 12). Quick evaluation ~15 minutes.

Integration with evaluator-optimizer

This skill provides preset rubrics for the evaluator-optimizer pipeline:

/omcustom:harness-eval → loads rubric → evaluator-optimizer executes → scoring → report

The evaluator-optimizer skill's pre_negotiation phase accepts harness-eval rubric dimensions as sprint contract criteria.

Output

Results saved to .claude/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md with per-task scores and aggregate grade.

Attribution

Evaluation framework based on research by revfactory/claude-code-harness. Adapted for oh-my-customcode's evaluator-optimizer pipeline with permission.

baekenough/harness-eval

.claude/skills/harness-eval/SKILL.md

Structured SE task evaluation using 15 benchmark definitions from claude-code-harness research

11 stars

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add baekenough/oh-my-customcode harness-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 8:17 PM6.8s1 file scanned

SKILL.md

name:: harness-eval
description:: Structured SE task evaluation using 15 benchmark definitions from claude-code-harness research
scope:: harness
user-invocable:: true
argument-hint:: [--preset all|quick] [--task task-name]
effort:: high
version:: 1.0.0

Harness Eval — Structured SE Task Benchmark

Purpose

Usage

/omcustom:harness-eval                    # Run all 15 benchmarks
/omcustom:harness-eval --preset quick     # Run top 5 high-impact benchmarks
/omcustom:harness-eval --task api-design  # Run specific task benchmark

Quality Dimensions

15 SE Task Benchmark Suite

Scoring Rubric

Each task is scored 0-100 across the 4 quality dimensions:

Score = (test_coverage × 0.30) + (architecture × 0.25) + (error_handling × 0.25) + (extensibility × 0.20)

Score Thresholds

Presets

`all` (default)

Run all 15 tasks. Full evaluation ~45 minutes.

`quick`

Run top 5 high-impact tasks (1, 3, 4, 5, 12). Quick evaluation ~15 minutes.

Integration with evaluator-optimizer

This skill provides preset rubrics for the evaluator-optimizer pipeline:

/omcustom:harness-eval → loads rubric → evaluator-optimizer executes → scoring → report

The evaluator-optimizer skill's pre_negotiation phase accepts harness-eval rubric dimensions as sprint contract criteria.

Output

Results saved to .claude/outputs/sessions/{YYYY-MM-DD}/harness-eval-{HHmmss}.md with per-task scores and aggregate grade.

Attribution

Evaluation framework based on research by revfactory/claude-code-harness. Adapted for oh-my-customcode's evaluator-optimizer pipeline with permission.

Related Skills

baekenough/wiki

development

VerifiedTrustedCommunity

Generate and maintain a persistent codebase wiki — LLM-built interlinked markdown knowledge base (Karpathy LLM Wiki pattern)

15SKILL.mdUpdated Apr 16, 2026

baekenough/wiki-rag

development

VerifiedTrustedCommunity

Use the project wiki as RAG knowledge source — search wiki pages to answer codebase questions before exploring raw files

15SKILL.mdUpdated Apr 16, 2026

baekenough/skill-extractor

tools

VerifiedTrustedCommunity

Analyze task trajectories to propose reusable SKILL.md candidates from successful patterns

15SKILL.mdUpdated Apr 16, 2026

baekenough/skill-extractor

baekenough/hada-scout

data-ai

VerifiedTrustedCommunity

hada.io RSS feed monitoring for AI agent/harness articles with automated /scout analysis

15SKILL.mdUpdated Apr 16, 2026

baekenough/hada-scout

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/baekenough/oh-my-customcode.git

# Copy into Claude Code skills folder (global)
cp -r oh-my-customcode/.claude/skills/harness-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

baekenough/oh-my-customcode

11 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

baekenough/harness-eval

$ install --global

Security Scan Results

SKILL.md

Harness Eval — Structured SE Task Benchmark

Purpose

Usage

Quality Dimensions

15 SE Task Benchmark Suite

Scoring Rubric

Score Thresholds

Presets

all (default)

quick

Integration with evaluator-optimizer

Output

Attribution

Related Skills

baekenough/wiki

baekenough/wiki-rag

baekenough/skill-extractor

baekenough/hada-scout

baekenough/harness-eval

$ install --global

Security Scan Results

SKILL.md

Harness Eval — Structured SE Task Benchmark

Purpose

Usage

Quality Dimensions

15 SE Task Benchmark Suite

Scoring Rubric

Score Thresholds

Presets

all (default)

quick

Integration with evaluator-optimizer

Output

Attribution

Related Skills

baekenough/wiki

baekenough/wiki-rag

baekenough/skill-extractor

baekenough/hada-scout

`all` (default)

`quick`

`all` (default)

`quick`