Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

aum08desai/eval-and-ablation

Name: eval-and-ablation
Author: aum08desai

skills/research/eval-and-ablation/SKILL.md

npx skillsauth add aum08desai/hermes-research-agent eval-and-ablation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval And Ablation

Workflow

Decide the primary evaluation question first.
Compare against the strongest relevant baseline, not just the previous checkpoint.
Split outputs into:

headline metrics
failure cases
regressions
cost or latency tradeoffs

Design ablations that isolate one factor at a time:

data
reward or preference signal
prompt format
optimizer or hyperparameters
checkpoint selection

Write down what changed and what remains uncertain.

Minimum report

benchmark table
one paragraph of interpretation
one paragraph on regressions or ambiguity
concrete next step

aum08desai/eval-and-ablation

skills/research/eval-and-ablation/SKILL.md

Plan and interpret model evaluations and ablations for post-training research. Use when comparing checkpoints, designing ablations, selecting benchmarks, or summarizing what changed after training.

6 stars

testing

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add aum08desai/hermes-research-agent eval-and-ablation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 11:22 AM375.8s2 files scanned

SKILL.md

name:: eval-and-ablation
description:: Plan and interpret model evaluations and ablations for post-training research. Use when comparing checkpoints, designing ablations, selecting benchmarks, or summarizing what changed after training.
version:: 1.0.0
author:: Hermes Research Agent
license:: MIT
tags:: [evaluation, ablation, benchmarking, analysis]

Eval And Ablation

Workflow

Decide the primary evaluation question first.
Compare against the strongest relevant baseline, not just the previous checkpoint.
Split outputs into:

headline metrics
failure cases
regressions
cost or latency tradeoffs

Design ablations that isolate one factor at a time:

data
reward or preference signal
prompt format
optimizer or hyperparameters
checkpoint selection

Write down what changed and what remains uncertain.

Minimum report

benchmark table
one paragraph of interpretation
one paragraph on regressions or ambiguity
concrete next step

Related Skills

aum08desai/writing-plans

development

VerifiedTrustedCommunity

Use when you have a spec or requirements for a multi-step task. Creates comprehensive implementation plans with bite-sized tasks, exact file paths, and complete code examples.

6SKILL.mdUpdated Apr 3, 2026

aum08desai/writing-plans

aum08desai/test-driven-development

development

VerifiedTrustedCommunity

Use when implementing any feature or bugfix, before writing implementation code. Enforces RED-GREEN-REFACTOR cycle with test-first approach.

6SKILL.mdUpdated Apr 3, 2026

aum08desai/test-driven-development

aum08desai/systematic-debugging

development

VerifiedTrustedCommunity

Use when encountering any bug, test failure, or unexpected behavior. 4-phase root cause investigation — NO fixes without understanding the problem first.

6SKILL.mdUpdated Apr 3, 2026

aum08desai/systematic-debugging

aum08desai/subagent-driven-development

development

VerifiedTrustedCommunity

Use when executing implementation plans with independent tasks. Dispatches fresh delegate_task per task with two-stage review (spec compliance then code quality).

6SKILL.mdUpdated Apr 3, 2026

aum08desai/subagent-driven-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/aum08desai/hermes-research-agent.git

# Copy into Claude Code skills folder (global)
cp -r hermes-research-agent/skills/research/eval-and-ablation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

aum08desai/hermes-research-agent

6 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT