Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oimiragieo/eval-harness-updater

Name: eval-harness-updater
Author: oimiragieo

.claude/skills/eval-harness-updater/SKILL.md

npx skillsauth add oimiragieo/agent-studio eval-harness-updater

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Harness Updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

Focus Areas

Prompt and parser drift
Timeout/partial-stream handling
SLO and regression gates
Dual-run fallback consistency

Workflow

Resolve harness path.
Research test/eval best practices (Exa + arXiv — see Research Gate below).
Add RED regressions for parsing and timeout edge cases.
Patch minimal harness logic.
Validate eval outputs and CI gates.
Resolve companion artifact gaps (see Cross-Reference table below).

Research Gate (Exa + arXiv — BOTH MANDATORY)

Before proposing harness changes, gather current best practices:

Use Exa for implementation and ecosystem patterns:
- mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })
- mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })
Search arXiv for academic research on evaluation methodology (mandatory):
- Via Exa: mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' })
- Direct API: WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })
Record decisions, constraints, and non-goals in memory learnings.

arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.

Cross-Reference: Creator Ecosystem

This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:

| Gap Discovered | Required Artifact | Creator to Invoke | When | | ---------------------------------------- | ----------------- | -------------------------------------- | --------------------------------- | | Domain knowledge needs a reusable skill | skill | Skill({ skill: 'skill-creator' }) | Gap is a full skill domain | | Existing skill has incomplete coverage | skill update | Skill({ skill: 'skill-updater' }) | Close skill exists but incomplete | | Capability needs a dedicated agent | agent | Skill({ skill: 'agent-creator' }) | Agent to own the capability | | Existing agent needs capability update | agent update | Skill({ skill: 'agent-updater' }) | Close agent exists but incomplete | | Domain needs code/project scaffolding | template | Skill({ skill: 'template-creator' }) | Reusable code patterns needed | | Behavior needs pre/post execution guards | hook | Skill({ skill: 'hook-creator' }) | Enforcement behavior required | | Process needs multi-phase orchestration | workflow | Skill({ skill: 'workflow-creator' }) | Multi-step coordination needed | | Artifact needs structured I/O validation | schema | Skill({ skill: 'schema-creator' }) | JSON schema for artifact I/O | | User interaction needs a slash command | command | Skill({ skill: 'command-creator' }) | User-facing shortcut needed | | Repeated logic needs a reusable CLI tool | tool | Skill({ skill: 'tool-creator' }) | CLI utility needed | | Narrow/single-artifact capability only | inline | Document within this artifact only | Too specific to generalize |

Iron Laws

ALWAYS run the Exa + arXiv research gate before updating any eval harness — updating without current external knowledge produces stale evaluation criteria.
NEVER remove existing evaluation criteria without replacing them with equivalent or better ones — reducing test coverage in an eval harness is a regression.
ALWAYS cross-reference the creator ecosystem for gaps before declaring the harness complete — missing companion artifacts (skills, agents, schemas) leave the harness unable to test new capabilities.
NEVER update eval harness in isolation from the skill/agent it evaluates — harness and artifact must stay synchronized or the harness tests the wrong behavior.
ALWAYS preserve backward compatibility in eval scoring — changing scoring semantics without migrating historical baselines makes trend analysis impossible.

Anti-Patterns

| Anti-Pattern | Why It Fails | Correct Approach | | -------------------------------------------------- | ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | | Updating eval harness without research gate | Criteria based on outdated knowledge; misses recent evaluation methodology advances | Always run Exa + arXiv research before updating any eval criteria | | Removing test cases to simplify the harness | Silently reduces coverage; regressions pass undetected | Only remove test cases when the behavior they tested has been deliberately removed | | Harness and artifact in separate PRs | Harness tests wrong behavior the moment artifact changes; immediate test drift | Always update harness and artifact in the same commit | | Changing scoring scale mid-project | Historical baselines become incomparable; trend analysis breaks | Define scoring scale once; create a migration if it must change | | Declaring harness complete without companion check | Missing skills or schemas leave evaluation gaps | Always run companion artifact check before marking harness update complete |

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

New evaluation pattern → .claude/context/memory/learnings.md
Evaluation gap found → .claude/context/memory/issues.md
Scoring decision made → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

oimiragieo/eval-harness-updater

.claude/skills/eval-harness-updater/SKILL.md

Refresh evaluation harnesses with live/fallback parser reliability, SLO gates, and regression checks.

23 stars

testing

Updated Apr 7, 2026

$ install --global

skillsauth

npx skillsauth add oimiragieo/agent-studio eval-harness-updater

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 7, 2026, 8:18 PM10.7s10 files scanned

SKILL.md

name:: eval-harness-updater
description:: Refresh evaluation harnesses with live/fallback parser reliability, SLO gates, and regression checks.
version:: 1.1.1
model:: sonnet
invoked_by:: both
user_invocable:: true
tools:: [Read, Write, Edit, Glob, Grep, Bash, Skill, MemoryRecord, WebSearch, WebFetch]
args:: --harness <path-or-name> [--trigger reflection|evolve|manual]
error_handling:: graceful
streaming:: supported
verified:: true
lastVerifiedAt:: 2026-02-28

Eval Harness Updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

Focus Areas

Prompt and parser drift
Timeout/partial-stream handling
SLO and regression gates
Dual-run fallback consistency

Workflow

Resolve harness path.
Research test/eval best practices (Exa + arXiv — see Research Gate below).
Add RED regressions for parsing and timeout edge cases.
Patch minimal harness logic.
Validate eval outputs and CI gates.
Resolve companion artifact gaps (see Cross-Reference table below).

Research Gate (Exa + arXiv — BOTH MANDATORY)

Before proposing harness changes, gather current best practices:

Use Exa for implementation and ecosystem patterns:
- mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })
- mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })
Search arXiv for academic research on evaluation methodology (mandatory):
- Via Exa: mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' })
- Direct API: WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })
Record decisions, constraints, and non-goals in memory learnings.

arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.

Cross-Reference: Creator Ecosystem

This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:

Iron Laws

ALWAYS run the Exa + arXiv research gate before updating any eval harness — updating without current external knowledge produces stale evaluation criteria.
NEVER remove existing evaluation criteria without replacing them with equivalent or better ones — reducing test coverage in an eval harness is a regression.
ALWAYS cross-reference the creator ecosystem for gaps before declaring the harness complete — missing companion artifacts (skills, agents, schemas) leave the harness unable to test new capabilities.
NEVER update eval harness in isolation from the skill/agent it evaluates — harness and artifact must stay synchronized or the harness tests the wrong behavior.
ALWAYS preserve backward compatibility in eval scoring — changing scoring semantics without migrating historical baselines makes trend analysis impossible.

Anti-Patterns

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

New evaluation pattern → .claude/context/memory/learnings.md
Evaluation gap found → .claude/context/memory/issues.md
Scoring decision made → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

Related Skills

oimiragieo/neurokit2

tools

VerifiedTrustedCommunity

Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/networkx

tools

VerifiedTrustedCommunity

Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/molfeat

data-ai

VerifiedTrustedCommunity

Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.

24SKILL.mdUpdated Apr 15, 2026

oimiragieo/modal

development

VerifiedTrustedCommunity

Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.

24SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oimiragieo/agent-studio.git

# Copy into Claude Code skills folder (global)
cp -r agent-studio/.claude/skills/eval-harness-updater ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oimiragieo/agent-studio

23 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT