skills/eval-harness-kit/SKILL.md
Build and run deterministic evaluation suites for agent workflows (single-turn or agentic). Use when you need reproducible eval runs with manifests, graders, metrics, and JSONL logs for capability or regression tracking.
npx skillsauth add aufrank/agent-skills eval-harness-kitInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create eval manifests, run tasks through an agent or command harness, and grade outputs with deterministic checks and optional LLM rubrics. The harness writes trajectories, metrics, and summaries to disk for repeatable analysis.
templates/eval.manifest.json and edit tasks.python <CODEX_HOME>/skills/eval-harness-kit/scripts/run_eval.py --manifest <path> --run-id <id>eval_runs/<run-id>/ and the summary JSON.
Replace <CODEX_HOME> with your installed skill root (for example, ~/.codex or C:\Users\you\.codex).run_cmd writes a response file; graders check the output.run_cmd invokes your agent harness; graders check output plus optional transcript files.type: "llm_rubric" to call an external judge.llm_judge_cmd in the manifest or judge_cmd per task.{"passed": true|false, "score": 0.0-1.0, "details": "..."}.eval_runs/.scripts/run_eval.py: Execute evals from a manifest; writes JSONL results and summaries.scripts/grade_response.py: Grade a single output against expected data.scripts/compare_runs.py: Compare two results files and flag regressions.templates/eval.manifest.json: Example manifest with single-turn and agentic tasks.references/eval-roadmap.md: Guidance for building and maintaining eval suites.eval_runs/<run-id>/summary.json exists.compare_runs.py to compare two runs and verify regression detection.tools
Build and execute modular DAG workflows for long-context processing using slice/map/reduce/recurse/compact/filter operators. Use for one-shot batch jobs, standalone map-reduce pipelines, or when the context-dag plugin is not installed. Trigger when input exceeds the model's context window, when reproducible logged pipelines are needed, or when multi-level recursive processing is required. If context-dag is installed, the plugin's bundled dag_runner.py provides the same capability with persistent artifact storage.
documentation
Write in Austin Frank's voice and style. Use this skill whenever generating text that should sound like Austin — strategy docs, charters, proposals, business cases, vision documents, staffing requests, stakeholder updates, cover letters, mission statements, org design documents, or any professional prose where the user wants Austin's distinctive voice. Also use when the user asks to review, edit, or improve a draft for voice consistency, or when they reference "my style", "my voice", "write like me", or "Austin's style".
tools
Use mcpc to interact with the Notion MCP server: connect sessions, search workspace content, fetch pages/databases, and run helper scripts for common Notion actions.
tools
Decide between a scripted workflow and an autonomous agent harness, then scaffold the chosen path. Use when scoping new agentic systems or tool integrations.