paper2skill/paper2skill-systematic-empiricism/SKILL.md
Convert systematic empiricism papers into ranked practitioner checklists. Extracts implementation tricks, hyperparameter findings, and design choice ablations with conditions of applicability. Use this skill when extracting skills from Category 4 (Systematic Empiricism) papers — '37 PPO details'-style papers, hyperparameter studies, or ablation-heavy guides that systematize scattered knowledge.
npx skillsauth add ADu2021/skillXiv paper2skill-systematic-empiricismInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this extraction for papers that:
Value signal: These papers unlock immediate executable knowledge. The output is already a recipe.
Examples: "The 37 Implementation Details of PPO", "Regularization Matters in Policy Optimization", "An Empirical Study of Training End-to-End Vision-and-Language Transformers"
Skip this category if:
Start here: What problem does this paper solve for practitioners?
Pain point: [E.g., "PPO implementations vary wildly; unclear which tricks are essential vs. nice-to-have"]
Community impact: [E.g., "Reproducibility issues, wasted compute on low-impact tricks"]
Paper's claim: [E.g., "Systematic ablation identifies the 10 high-impact tricks that matter for convergence"]
Extract the quantitative impact ranking. Order by effect size (largest first).
High-impact tricks (>3% improvement):
- Trick A: +X% when [condition]
- Complexity: [trivial/moderate/high]
- Applicable to: [when does it help]
Medium-impact tricks (1-3% improvement):
- Trick B: +Y% when [condition]
Low-impact tricks (<1% improvement):
- Trick C: +Z% but conditional on [specific setup]
- Trick D: No clear benefit, may hurt in [specific cases]
Surprising findings:
- Conventional wisdom was wrong about: [what assumption failed]
Create a ranked checklist practitioners should follow first-to-last:
Checklist (prioritized by impact and cost):
☐ Step 1: [High-impact, low-cost trick] — Essential, do this first
Condition: Only if [dataset/model/task property]
☐ Step 2: [High-impact, moderate-cost trick]
Condition: Only if [additional property]
☐ Skip: [Low-impact trick] — Not worth it unless [very specific case]
For each trick, extract when it helps and when it doesn't:
Trick X works best when:
- Model scale: [small/medium/large]
- Dataset size: [tiny/standard/large]
- Task type: [specific domains or task properties]
- Optimization regime: [early/late training, learning rate magnitude]
Trick X fails or becomes negative when:
- [Specific condition that breaks it]
- [Data regime where it hurt performance]
For tricks that need code, show the minimal implementation:
# 1-2 sentence explanation of what this does
def apply_trick_x(model, hyperparams):
# Minimal implementation showing exactly what changes
return modified_config
Store larger implementations in scripts/ folder.
Generate a SKILL.md that converts the findings into a practitioner resource:
---
name: [category-derived-name]
title: [Paper title — converted to action-oriented title]
version: 0.0.2
engine: skillxiv-v0.0.2-claude-opus-4.6
license: MIT
url: [arXiv HTML link]
keywords: [domain, trick-type, impact-level, conditions-tag, community-need]
description: |
[Outcome-focused]: Practitioners can [achieve X] by [applying Y ranked tricks] under [conditions Z].
Trigger: When [specific implementation challenge], reference this ranked checklist.
---
## What This Skill Does
[High-impact outcome]: Using the ranked tricks from [paper] can improve [metric] by [cumulative %] while reducing [cost/complexity].
[Checklist preview]: Start with [top 3 high-impact tricks], then optionally add [medium-impact tricks] if [conditions].
## Ranked Trick Checklist
[Full checklist with conditions for each]
## When to Use
- Optimizing [specific model/task] and want to prioritize implementation effort
- Reproducing baselines with confidence that you're using high-impact changes
- Deciding between competing design choices: reference the ablation results
## When NOT to Use
- If your model/task has radically different properties (e.g., paper tested on vision, you're doing language)
- When exploring novel architectures (tricks are typically tuned for specific model families)
For extraction success:
Common pitfalls to avoid:
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.