paper2skill/paper2skill-evaluation-infrastructure/SKILL.md
Convert dataset and benchmark papers into evaluation infrastructure skills. For datasets: extracts collection protocol, annotation design, quality control. For benchmarks: extracts task definition, metric selection, leaderboard design. Use this skill when extracting skills from Category 2 (Evaluation Infrastructure) papers — ImageNet-style dataset papers, SWE-bench-style benchmark papers, or any paper whose primary contribution is evaluation methodology.
npx skillsauth add ADu2021/skillXiv paper2skill-evaluation-infrastructureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply this skill when you encounter arXiv papers that:
Examples: COCO dataset design, GLUE benchmark construction, SWE-bench evaluation framework, SafetyBench methodology.
Do not use this skill for:
Extract how the raw data was gathered and organized.
**Collection Source:** Where does the raw data come from?
**Collection Scale:** How much data was collected? By whom?
**Collection Process:**
- Sampling strategy (random, stratified, targeted)
- Data preparation steps
- Format standardization
**Scale Considerations:** Storage, accessibility, versioning
Document the human annotation or labeling process.
**Annotation Schema:** What exactly is being annotated?
**Annotation Protocol:**
- Step-by-step instructions for annotators
- Definitions and edge cases
- Examples (prototypical and boundary cases)
**Quality Control:**
- Inter-annotator agreement metrics
- Agreement resolution procedure
- Crowd-worker vetting/training
**Annotation Cost:** Time and resource estimates
Extract data structure and usage specifications.
**Data Format:** JSON/CSV/HDF5 structure
**Splits:** Train/val/test reasoning and sizes
**Metadata:** What additional information is provided?
**Access & Licensing:** How is the dataset distributed?
Extract what the benchmark actually measures.
**Task Definition:** What is the benchmark asking models to do?
**Input/Output Spec:**
- Input format and constraints
- Output format and constraints
- Assumptions about model architecture or training
**Task Motivation:** Why is this task important?
**Gap from Real World:** What simplifications are made?
Document metric selection and interpretation.
**Primary Metrics:** [Metric name, formula, range]
**Why These Metrics:** What capability does each metric isolate?
**Metric Limitations:** What does each metric NOT measure?
**Secondary Metrics:** [How they provide additional insight]
**Baseline Selection:** What baselines are included and why?
Extract what the benchmark reveals.
**What It Measures:** Core capabilities tested
**What People Think It Measures:** Common misconceptions
**Failure Modes:** What types of errors does it NOT catch?
**Correlation with Real Performance:** If studied, how well does benchmark performance predict real-world success?
**Known Saturation:** Is the benchmark being saturated? What's next?
For competitive benchmarks, extract evaluation infrastructure.
**Leaderboard Mechanics:** How is performance tracked?
**Submission Process:** Frequency, format, evaluation turnaround
**Evaluation Server:** How are answers checked?
**Cheating Prevention:** How is overfitting to eval set prevented?
**Reporting Standards:** What metadata must submissions include?
Generate a new SKILL.md with:
Frontmatter:
---
name: [kebab-case-dataset-or-benchmark-name]
title: [Dataset/Benchmark: {Name} Evaluation Infrastructure]
version: 0.0.2
engine: skillxiv-v0.0.2-claude-opus-4.6
license: MIT
url: [verified arxiv link to source paper]
keywords: [evaluation, infrastructure, metric-selection, annotation-protocol, or data-curation]
description: Build evaluation infrastructure for {domain} by replicating {benchmark/dataset} methodology. Includes {annotation protocol OR task definition}, {quality control OR metric selection}, enabling {outcome} for {use case}.
---
For Dataset Skills, include:
For Benchmark Skills, include:
Length: 150-250 lines
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.