skills/icml-reviewer/SKILL.md
Paper reviewer that evaluates machine learning research projects following official ICML reviewer guidelines. Provides comprehensive reviews with actionable feedback across all key dimensions: claims/evidence, relation to prior work, originality, significance, clarity, and reproducibility. Also provides formative feedback on incomplete drafts, proposals, and research code repositories. MANDATORY TRIGGERS: review paper, ICML review, paper review, evaluate paper, research paper feedback, ML paper review, conference review, academic review, paper critique, NeurIPS review, ICLR review, project proposal, research proposal, paper draft, early feedback, incomplete paper, work in progress, WIP review, review repo, review codebase, research project review
npx skillsauth add sundial-org/skills icml-reviewerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Enables rigorous review of ML research papers following official ICML guidelines.
Determine input type:
For complete papers, extract: title, abstract, main claims, methodology, experiments, results. Identify paper type: theoretical, methodological, algorithmic, empirical, bridge paper, or application-driven.
For code repositories, first explore: read README, scan code structure, find experiment scripts/results, identify the research question and what's implemented.
This step applies to ALL input types. Grounding in reality is essential for any meaningful feedback.
Critical mindset:
Then proceed to mode-specific evaluation.
Evaluate across 7 dimensions (see references/evaluation-criteria.md). Default to skepticism—require strong evidence to score highly.
| Dimension | Key Questions (Answer with Literature Evidence) | |-----------|---------------| | Originality | Is this truly novel given recent work X, Y, Z? What specific aspects are incremental vs. novel? | | Importance | Why does this problem matter? What's the real-world impact? Who will care? | | Claims Support | Do experiments actually prove the claims? What alternative explanations exist? | | Experimental Soundness | Are baselines from 2023+? Are comparisons fair? What's missing? | | Clarity | Can I reproduce this from the paper? Are claims precisely stated? | | Community Value | Will this change how people work? Or just add noise? | | Prior Work Context | Are comparisons accurate? What recent work (last 2 years) is missing? |
Evaluation mindset:
Before writing the review, explicitly verify:
This step is not optional. Your review must reference specific findings from your literature search.
Follow the ICML review form (see references/review-template.md):
Your primary duty is to the research community—publishing weak papers dilutes the literature.
"Review the papers of others as you would wish your own to be reviewed"—with rigor, honesty, and specific feedback grounded in the literature.
Bad: "The experiments are weak" Good: "Experiments compare only against [X] from 2019, but recent baselines [Y] (2024) and [Z] (2024) should be included."
Originality may arise from: creative combinations, new domains, removing restrictive assumptions, novel datasets, new problem formulations.
But: Most claimed novelty is actually incremental. Verify against literature before accepting novelty claims.
Use this reference frame:
If you find yourself giving mostly 4s and 5s, you're likely being too generous. Re-calibrate against what the literature shows is standard.
For application-driven ML: methods should fit real-world constraints, non-standard datasets acceptable if documented, compare against domain baselines.
Overall (1-5): Use the full range. Most papers should be 2-3.
Red flags that should lower scores:
Confidence (1-5): 5=Expert/certain, 4=Confident, 3=Fairly confident, 2=Uncertain, 1=Not in area
Use this mode for incomplete drafts, research proposals, or code repositories. Focus shifts from "accept/reject evaluation" to "constructive guidance on how to make this publishable."
After completing Steps 1-2 (input analysis and prior work grounding), proceed here.
Use the Early-Stage Feedback Template (see references/review-template.md). No numerical scores—focus on constructive guidance.
For code repositories, additionally address:
references/evaluation-criteria.md - Detailed criteria for each dimensionreferences/review-template.md - Full template with examplesreferences/common-issues.md - Common paper issues to identifydevelopment
Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.
testing
Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
development
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
data-ai
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.