.claude/skills/eval-harness-updater/SKILL.md
Refresh evaluation harnesses with live/fallback parser reliability, SLO gates, and regression checks.
npx skillsauth add oimiragieo/agent-studio eval-harness-updaterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.
Before proposing harness changes, gather current best practices:
mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' })WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.
This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:
| Gap Discovered | Required Artifact | Creator to Invoke | When |
| ---------------------------------------- | ----------------- | -------------------------------------- | --------------------------------- |
| Domain knowledge needs a reusable skill | skill | Skill({ skill: 'skill-creator' }) | Gap is a full skill domain |
| Existing skill has incomplete coverage | skill update | Skill({ skill: 'skill-updater' }) | Close skill exists but incomplete |
| Capability needs a dedicated agent | agent | Skill({ skill: 'agent-creator' }) | Agent to own the capability |
| Existing agent needs capability update | agent update | Skill({ skill: 'agent-updater' }) | Close agent exists but incomplete |
| Domain needs code/project scaffolding | template | Skill({ skill: 'template-creator' }) | Reusable code patterns needed |
| Behavior needs pre/post execution guards | hook | Skill({ skill: 'hook-creator' }) | Enforcement behavior required |
| Process needs multi-phase orchestration | workflow | Skill({ skill: 'workflow-creator' }) | Multi-step coordination needed |
| Artifact needs structured I/O validation | schema | Skill({ skill: 'schema-creator' }) | JSON schema for artifact I/O |
| User interaction needs a slash command | command | Skill({ skill: 'command-creator' }) | User-facing shortcut needed |
| Repeated logic needs a reusable CLI tool | tool | Skill({ skill: 'tool-creator' }) | CLI utility needed |
| Narrow/single-artifact capability only | inline | Document within this artifact only | Too specific to generalize |
| Anti-Pattern | Why It Fails | Correct Approach | | -------------------------------------------------- | ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | | Updating eval harness without research gate | Criteria based on outdated knowledge; misses recent evaluation methodology advances | Always run Exa + arXiv research before updating any eval criteria | | Removing test cases to simplify the harness | Silently reduces coverage; regressions pass undetected | Only remove test cases when the behavior they tested has been deliberately removed | | Harness and artifact in separate PRs | Harness tests wrong behavior the moment artifact changes; immediate test drift | Always update harness and artifact in the same commit | | Changing scoring scale mid-project | Historical baselines become incomparable; trend analysis breaks | Define scoring scale once; create a migration if it must change | | Declaring harness complete without companion check | Missing skills or schemas leave evaluation gaps | Always run companion artifact check before marking harness update complete |
Before starting:
Read .claude/context/memory/learnings.md
After completing:
.claude/context/memory/learnings.md.claude/context/memory/issues.md.claude/context/memory/decisions.mdASSUME INTERRUPTION: If it's not in memory, it didn't happen.
tools
Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Applicable for heart rate variability analysis, event-related potentials, complexity measures, autonomic nervous system assessment, psychophysiology research, and multi-modal physiological signal integration.
tools
Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs in Python. Use when working with network/graph data structures, analyzing relationships between entities, computing graph algorithms (shortest paths, centrality, clustering), detecting communities, generating synthetic networks, or visualizing network topologies. Applicable to social networks, biological networks, transportation systems, citation networks, and any domain involving pairwise relationships.
data-ai
Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML.
development
Run Python code in the cloud with serverless containers, GPUs, and autoscaling. Use when deploying ML models, running batch processing jobs, scheduling compute-intensive tasks, or serving APIs that require GPU acceleration or dynamic scaling.