Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bereniketech/eval-harness

Name: eval-harness
Author: bereniketech

skills/core/eval-harness/SKILL.md

npx skillsauth add bereniketech/claude_kit eval-harness

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Harness

Define expected behavior before writing code. Treat evals as the unit tests of AI-assisted development.

1. Eval-Before-Code Philosophy

Write eval definitions before any implementation. Evals force precise thinking about success criteria and create a measurable definition of done. Store eval definitions as version-controlled artifacts alongside the code they govern.

Rule: Never start implementation on a new feature or agent change without a written eval definition file first.

2. Capability Evals

Use capability evals to verify Claude can accomplish something new. Define the task, enumerate success criteria as checkboxes, and describe the expected output in observable terms.

[CAPABILITY EVAL: feature-name]
Task: Description of what the agent must accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

Rule: Every success criterion must be independently verifiable — no compound criteria in a single checkbox.

3. Regression Evals

Use regression evals to ensure existing functionality remains intact after a change. Reference a baseline commit or checkpoint. Record pass/fail for every test in the set.

[REGRESSION EVAL: feature-name]
Baseline: <git SHA or checkpoint name>
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Rule: A regression eval that drops below its prior score is a blocking finding — do not ship.

4. pass@k and pass^k Metrics

Use pass@k ("at least one success in k attempts") to measure practical reliability. Use pass^k ("all k attempts succeed") to measure stability on critical paths. Track both metrics over time to detect reliability drift.

pass@1: first-attempt success rate
pass@3: success within 3 attempts — target ≥ 90% for capability evals
pass^3: all 3 consecutive runs pass — require 100% for release-critical regression evals

Rule: Never promote a feature to production with pass@3 < 0.90 on its capability evals.

5. Grader Types

Choose the grader type that provides the most determinism for each criterion.

Code grader: deterministic shell assertions (grep -q, npm test, npm run build)
Model grader: LLM-as-judge rubric scoring 1–5 with explicit reasoning — use for open-ended output quality
Human grader: flag for manual review with a stated risk level — required for security and financial logic

Rule: Default to code graders. Use model graders only when deterministic checks are impossible. Always use human graders for security-relevant changes.

6. Eval Workflow

Follow four phases for every feature or agent change.

Define (before coding): write the eval definition file at .claude/evals/<feature>.md
Implement: write code targeting the defined criteria
Evaluate: run each criterion, record PASS/FAIL and attempt count
Report: produce a structured summary with pass@k scores and a ship/block decision

EVAL REPORT: feature-xyz
Capability: 3/3 passed — pass@1: 67%, pass@3: 100%
Regression: 3/3 passed — pass^3: 100%
Status: READY FOR REVIEW

Rule: An eval report is a required artifact for every PR that modifies agent behavior or prompts.

7. Eval Storage Layout

Store all eval artifacts under .claude/evals/ so they are versioned with the code.

.claude/
  evals/
    <feature>.md       # eval definition
    <feature>.log      # run history
    baseline.json      # regression baselines
docs/releases/<ver>/
    eval-summary.md    # release snapshot

Rule: Never delete eval logs — they are the longitudinal record of agent reliability.

8. Eval Anti-Patterns

Do not overfit prompts to known eval examples. Do not measure only happy-path outputs. Do not ignore latency and cost drift while optimizing pass rates. Do not allow flaky graders in release gates — a grader that produces inconsistent results on the same input must be fixed or replaced before it gates any release.

Rule: If a grader itself is flaky, the eval suite is invalid — fix the grader before running any evaluation.

bereniketech/eval-harness

skills/core/eval-harness/SKILL.md

Apply eval-driven development (EDD) by defining pass/fail criteria before implementation and measuring reliability with pass@k metrics.

development

Updated Apr 18, 2026

$ install --global

skillsauth

npx skillsauth add bereniketech/claude_kit eval-harness

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 18, 2026, 3:47 AM56.3s1 file scanned

SKILL.md

name:: eval-harness
description:: Apply eval-driven development (EDD) by defining pass/fail criteria before implementation and measuring reliability with pass@k metrics.

Eval Harness

Define expected behavior before writing code. Treat evals as the unit tests of AI-assisted development.

1. Eval-Before-Code Philosophy

Rule: Never start implementation on a new feature or agent change without a written eval definition file first.

2. Capability Evals

Use capability evals to verify Claude can accomplish something new. Define the task, enumerate success criteria as checkboxes, and describe the expected output in observable terms.

[CAPABILITY EVAL: feature-name]
Task: Description of what the agent must accomplish
Success Criteria:
  - [ ] Criterion 1
  - [ ] Criterion 2
  - [ ] Criterion 3
Expected Output: Description of expected result

Rule: Every success criterion must be independently verifiable — no compound criteria in a single checkbox.

3. Regression Evals

Use regression evals to ensure existing functionality remains intact after a change. Reference a baseline commit or checkpoint. Record pass/fail for every test in the set.

[REGRESSION EVAL: feature-name]
Baseline: <git SHA or checkpoint name>
Tests:
  - existing-test-1: PASS/FAIL
  - existing-test-2: PASS/FAIL
Result: X/Y passed (previously Y/Y)

Rule: A regression eval that drops below its prior score is a blocking finding — do not ship.

4. pass@k and pass^k Metrics

pass@1: first-attempt success rate
pass@3: success within 3 attempts — target ≥ 90% for capability evals
pass^3: all 3 consecutive runs pass — require 100% for release-critical regression evals

Rule: Never promote a feature to production with pass@3 < 0.90 on its capability evals.

5. Grader Types

Choose the grader type that provides the most determinism for each criterion.

Code grader: deterministic shell assertions (grep -q, npm test, npm run build)
Model grader: LLM-as-judge rubric scoring 1–5 with explicit reasoning — use for open-ended output quality
Human grader: flag for manual review with a stated risk level — required for security and financial logic

Rule: Default to code graders. Use model graders only when deterministic checks are impossible. Always use human graders for security-relevant changes.

6. Eval Workflow

Follow four phases for every feature or agent change.

Define (before coding): write the eval definition file at .claude/evals/<feature>.md
Implement: write code targeting the defined criteria
Evaluate: run each criterion, record PASS/FAIL and attempt count
Report: produce a structured summary with pass@k scores and a ship/block decision

EVAL REPORT: feature-xyz
Capability: 3/3 passed — pass@1: 67%, pass@3: 100%
Regression: 3/3 passed — pass^3: 100%
Status: READY FOR REVIEW

Rule: An eval report is a required artifact for every PR that modifies agent behavior or prompts.

7. Eval Storage Layout

Store all eval artifacts under .claude/evals/ so they are versioned with the code.

.claude/
  evals/
    <feature>.md       # eval definition
    <feature>.log      # run history
    baseline.json      # regression baselines
docs/releases/<ver>/
    eval-summary.md    # release snapshot

Rule: Never delete eval logs — they are the longitudinal record of agent reliability.

8. Eval Anti-Patterns

Rule: If a grader itself is flaky, the eval suite is invalid — fix the grader before running any evaluation.

Related Skills

bereniketech/anti-reversing-techniques

testing

VerifiedTrustedCommunity

AUTHORIZED USE ONLY: This skill contains dual-use security techniques. Before proceeding with any bypass or analysis: > 1.

SKILL.mdUpdated May 3, 2026

bereniketech/anti-reversing-techniques

bereniketech/active-directory-attacks

testing

Community

Provide comprehensive techniques for attacking Microsoft Active Directory environments. Covers reconnaissance, credential harvesting, Kerberos attacks, lateral movement, privilege escalation, and domain dominance for red team operations and penetration testing.

SKILL.mdUpdated May 3, 2026

bereniketech/active-directory-attacks

Security Scans

mcp-scan — Pending Scan

Semgrep — Pending Scan

Trivy — Pending Scan

OWASP — Pending Scan

VirusTotal — Pending Scan

bereniketech/zeroize-audit

development

VerifiedTrustedCommunity

Detects missing zeroization of sensitive data in source code and identifies zeroization removed by compiler optimizations, with assembly-level analysis, and control-flow verification. Use for auditing C/C++/Rust code handling secrets, keys, passwords, or other sensitive data.

SKILL.mdUpdated May 3, 2026

bereniketech/zeroize-audit

bereniketech/wcag-audit-patterns

development

VerifiedTrustedCommunity

Comprehensive guide to auditing web content against WCAG 2.2 guidelines with actionable remediation strategies.

SKILL.mdUpdated May 3, 2026

bereniketech/wcag-audit-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bereniketech/claude_kit.git

# Copy into Claude Code skills folder (global)
cp -r claude_kit/skills/core/eval-harness ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bereniketech/claude_kit

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT