Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aufrank/eval-harness-kit

Name: eval-harness-kit
Author: aufrank

skills/eval-harness-kit/SKILL.md

npx skillsauth add aufrank/agent-skills eval-harness-kit

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Harness Kit

Overview

Create eval manifests, run tasks through an agent or command harness, and grade outputs with deterministic checks and optional LLM rubrics. The harness writes trajectories, metrics, and summaries to disk for repeatable analysis.

Quick start

Copy templates/eval.manifest.json and edit tasks.
Run: python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id>
Inspect outputs in eval_runs/<run-id>/ and the summary JSON. Replace <CODEX_HOME> with your installed skill root (for example, ~/.codex or C:\Users\you\.codex).

Single-turn vs agentic

Single-turn: run_cmd writes a response file; graders check the output.
Agentic: run_cmd invokes your agent harness; graders check output plus optional transcript files.

LLM rubric graders (optional)

Use type: "llm_rubric" to call an external judge.
Provide llm_judge_cmd in the manifest or judge_cmd per task.
The judge must print JSON: {"passed": true|false, "score": 0.0-1.0, "details": "..."}.

Core Guidance

Decide capability vs regression up front; keep regression suites near 100% pass rate.
Prefer deterministic graders (exact/regex/json) and add LLM rubrics only when needed.
Keep each trial isolated; write outputs and transcripts to the run directory.
Log metrics for every trial: latency, exit code, stdout/stderr sizes, output size.
Use files as the memory boundary; do not paste large outputs into chat.

Trust / Permissions

Always: Read local files, write run artifacts under eval_runs/.
Ask: Any networked grader (LLM rubric), running commands that mutate state, or running tools outside the repo.
Never: Exfiltrating credentials or running destructive commands without explicit user request.

Resources

scripts/run_eval.py: Execute evals from a manifest; writes JSONL results and summaries.
scripts/grade_response.py: Grade a single output against expected data.
scripts/compare_runs.py: Compare two results files and flag regressions.
templates/eval.manifest.json: Example manifest with single-turn and agentic tasks.
references/eval-roadmap.md: Guidance for building and maintaining eval suites.

Validation

Run the example manifest; confirm eval_runs/<run-id>/summary.json exists.
Use compare_runs.py to compare two runs and verify regression detection.

aufrank/eval-harness-kit

skills/eval-harness-kit/SKILL.md

Build and run deterministic evaluation suites for agent workflows (single-turn or agentic). Use when you need reproducible eval runs with manifests, graders, metrics, and JSONL logs for capability or regression tracking.

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add aufrank/agent-skills eval-harness-kit

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 11:54 AM4.8s12 files scanned

SKILL.md

name:: eval-harness-kit
description:: Build and run deterministic evaluation suites for agent workflows (single-turn or agentic). Use when you need reproducible eval runs with manifests, graders, metrics, and JSONL logs for capability or regression tracking.

Eval Harness Kit

Overview

Quick start

Copy templates/eval.manifest.json and edit tasks.
Run: python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id>
Inspect outputs in eval_runs/<run-id>/ and the summary JSON. Replace <CODEX_HOME> with your installed skill root (for example, ~/.codex or C:\Users\you\.codex).

Single-turn vs agentic

Single-turn: run_cmd writes a response file; graders check the output.
Agentic: run_cmd invokes your agent harness; graders check output plus optional transcript files.

LLM rubric graders (optional)

Use type: "llm_rubric" to call an external judge.
Provide llm_judge_cmd in the manifest or judge_cmd per task.
The judge must print JSON: {"passed": true|false, "score": 0.0-1.0, "details": "..."}.

Core Guidance

Decide capability vs regression up front; keep regression suites near 100% pass rate.
Prefer deterministic graders (exact/regex/json) and add LLM rubrics only when needed.
Keep each trial isolated; write outputs and transcripts to the run directory.
Log metrics for every trial: latency, exit code, stdout/stderr sizes, output size.
Use files as the memory boundary; do not paste large outputs into chat.

Trust / Permissions

Always: Read local files, write run artifacts under eval_runs/.
Ask: Any networked grader (LLM rubric), running commands that mutate state, or running tools outside the repo.
Never: Exfiltrating credentials or running destructive commands without explicit user request.

Resources

scripts/run_eval.py: Execute evals from a manifest; writes JSONL results and summaries.
scripts/grade_response.py: Grade a single output against expected data.
scripts/compare_runs.py: Compare two results files and flag regressions.
templates/eval.manifest.json: Example manifest with single-turn and agentic tasks.
references/eval-roadmap.md: Guidance for building and maintaining eval suites.

Validation

Run the example manifest; confirm eval_runs/<run-id>/summary.json exists.
Use compare_runs.py to compare two runs and verify regression detection.

Related Skills

aufrank/running-dag-pipelines

tools

VerifiedTrustedCommunity

Build and execute modular DAG workflows for long-context processing using slice/map/reduce/recurse/compact/filter operators. Use for one-shot batch jobs, standalone map-reduce pipelines, or when the context-dag plugin is not installed. Trigger when input exceeds the model's context window, when reproducible logged pipelines are needed, or when multi-level recursive processing is required. If context-dag is installed, the plugin's bundled dag_runner.py provides the same capability with persistent artifact storage.

SKILL.mdUpdated May 7, 2026

aufrank/running-dag-pipelines

aufrank/austin-frank-voice

documentation

VerifiedTrustedCommunity

Write in Austin Frank's voice and style. Use this skill whenever generating text that should sound like Austin — strategy docs, charters, proposals, business cases, vision documents, staffing requests, stakeholder updates, cover letters, mission statements, org design documents, or any professional prose where the user wants Austin's distinctive voice. Also use when the user asks to review, edit, or improve a draft for voice consistency, or when they reference "my style", "my voice", "write like me", or "Austin's style".

SKILL.mdUpdated May 7, 2026

aufrank/austin-frank-voice

aufrank/working-with-notion-programmatically

tools

VerifiedTrustedCommunity

Use mcpc to interact with the Notion MCP server: connect sessions, search workspace content, fetch pages/databases, and run helper scripts for common Notion actions.

SKILL.mdUpdated Apr 3, 2026

aufrank/working-with-notion-programmatically

aufrank/workflow-or-agent-decider

tools

VerifiedTrustedCommunity

Decide between a scripted workflow and an autonomous agent harness, then scaffold the chosen path. Use when scoping new agentic systems or tool integrations.

SKILL.mdUpdated Apr 3, 2026

aufrank/workflow-or-agent-decider

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aufrank/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/eval-harness-kit ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aufrank/agent-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT