skills/acegrpo-adaptive-curriculum-group/SKILL.md
Adaptive curriculum-driven iterative optimization for autonomous ML engineering tasks. Uses Evolving Data Buffers and Learnability Potential sampling from the AceGRPO paper to structure multi-step agent workflows that avoid behavioral stagnation. Triggers: 'optimize ML pipeline iteratively', 'adaptive curriculum for code tasks', 'iterative agent optimization', 'prioritize learning tasks', 'evolving task buffer', 'curriculum-based code improvement'.
npx skillsauth add ndpvt-web/arxiv-claude-skills acegrpo-adaptive-curriculum-groupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to structure autonomous, multi-step ML engineering workflows using the adaptive curriculum and group-relative optimization principles from the AceGRPO paper. Rather than attempting every sub-task with equal priority, Claude applies Learnability Potential scoring to dynamically rank candidate improvements, maintains an evolving buffer of past attempts to avoid repeating failures, and uses group-relative comparisons to select the most promising next action. The result is sustained iterative optimization that avoids the "behavioral stagnation" trap where agents repeat ineffective strategies.
The Core Problem. Standard LLM-based agents tackle ML engineering tasks by generating code, running it, observing results, and trying again. But without a principled way to select which task or sub-problem to focus on next, agents waste cycles on tasks that are too easy (no learning signal) or too hard (no progress possible). This is the behavioral stagnation problem: the agent keeps trying the same class of actions because its decision policy never updates.
AceGRPO's Solution: Two Interlocking Mechanisms. First, the Evolving Data Buffer captures every execution trace -- successful or failed -- and repurposes them as structured training signals. Failed attempts are not discarded; they become data about what doesn't work and under what conditions. The buffer evolves by continuously incorporating new traces while deprioritizing stale or uninformative ones. Second, Adaptive Sampling via Learnability Potential (LP) scores each candidate task by combining difficulty (is this within reach?) with improvability (can the agent still get better at this?). The LP function filters to tasks within one standard deviation of the agent's current difficulty frontier and ranks by improvability. This creates a curriculum that automatically shifts as the agent's capabilities change.
Group Relative Policy Optimization (GRPO). Instead of scoring each candidate action against an absolute reward baseline, GRPO samples multiple candidate trajectories for each task and computes advantages relative to the group: A(trajectory) = R(trajectory) - mean(R(group)). This within-group normalization dramatically reduces reward variance across heterogeneous ML tasks (where absolute scores are incomparable between, say, a classification and a regression task), making the selection signal stable and informative.
Decompose the ML engineering goal into a ranked task list. Parse the user's objective (e.g., "maximize F1 on this dataset") into discrete sub-tasks: data preprocessing, feature engineering, model selection, hyperparameter tuning, ensemble construction, submission formatting. Write each as an actionable item with a measurable outcome.
Initialize the Evolving Data Buffer. Create a structured log (JSON or markdown) to track every attempted action, its code, the execution result (metrics, errors, runtime), and a difficulty tag (easy/medium/hard based on whether similar approaches have succeeded before). This buffer persists across iterations.
Score each candidate task with Learnability Potential. For each pending sub-task, estimate two values:
LP = in_difficulty_band(task) * improvability(task). Rank tasks by LP descending.Select the highest-LP task and generate a group of candidate solutions. For the top-ranked task, produce 3-5 distinct candidate approaches (e.g., different model architectures, different feature sets, different hyperparameter ranges). These form the "group" for relative comparison.
Execute all candidates and record traces. Run each candidate, capturing full execution traces: code, stdout/stderr, metrics, wall time. Append all traces to the Evolving Data Buffer regardless of success or failure.
Compute group-relative advantages. For each candidate's result, calculate advantage = score - mean(group_scores). The candidate with the highest positive advantage is selected. If all advantages are near zero, the task may be at a plateau -- deprioritize it.
Update the buffer and recalculate LP scores. With new trace data, recalculate difficulty estimates (tasks with more failures become harder) and improvability (tasks where the best score just improved have reduced improvability). Re-rank all pending tasks.
Iterate: pick the next highest-LP task. Repeat steps 4-7 until the user's target metric is met, a budget is exhausted, or all remaining tasks have LP below a threshold (indicating diminishing returns).
Consolidate results. Merge the best-performing components from across all iterations into a final solution. Report the full optimization trajectory: which tasks were attempted, in what order, and what each contributed.
Archive the buffer for future reuse. Save the final Evolving Data Buffer so that future optimization sessions on similar problems can bootstrap from prior experience rather than starting cold.
Example 1: Iterative Kaggle Competition Optimization
User: "I have a tabular classification dataset (train.csv, test.csv). Help me
iteratively improve my submission score on this Kaggle competition. Current
baseline with logistic regression gets 0.72 AUC. Target: 0.88+."
Approach:
1. Decompose into sub-tasks:
- Feature engineering (missing values, encoding, interactions)
- Model selection (GBM, random forest, neural net, SVM)
- Hyperparameter tuning for best model
- Ensemble construction
- Submission formatting
2. Initialize buffer:
{
"traces": [
{"id": 1, "task": "baseline", "approach": "logistic_regression",
"score": 0.72, "difficulty": "easy", "status": "success"}
],
"mean_difficulty": 0.3, "std_difficulty": 0.1
}
3. Score LP for each sub-task:
- Feature engineering: difficulty=0.4 (in band), improvability=0.88-0.72=0.16 -> LP=0.16
- Model selection: difficulty=0.3 (in band), improvability=0.16 -> LP=0.16
- Hyperparameter tuning: difficulty=0.5 (in band), improvability=0.10 -> LP=0.10
- Ensemble: difficulty=0.7 (out of band), improvability=0.05 -> LP=0.0
4. Tie between feature engineering and model selection. Pick feature engineering
(higher expected impact on downstream tasks).
5. Generate group of 4 candidate feature sets:
- Candidate A: impute + one-hot encoding
- Candidate B: impute + target encoding + interaction features
- Candidate C: impute + PCA dimensionality reduction
- Candidate D: impute + frequency encoding + polynomial features
6. Execute all four, record scores:
A: 0.74, B: 0.79, C: 0.73, D: 0.76
Group mean: 0.755
Advantages: A=-0.015, B=+0.035, C=-0.025, D=+0.005
-> Select Candidate B (highest advantage)
7. Update buffer, recalculate LP. Feature engineering improvability drops
(0.88-0.79=0.09). Model selection now has highest LP.
8. Continue iterations: model selection -> hyperparameter tuning -> ensemble.
Output after 4 iterations:
Round 1: Feature engineering (target encoding) -> 0.79
Round 2: Model selection (LightGBM) -> 0.84
Round 3: Hyperparameter tuning (Optuna 50 trials) -> 0.87
Round 4: Ensemble (LightGBM + CatBoost blend) -> 0.89 [TARGET MET]
Example 2: Debugging a Failing ML Pipeline
User: "My training script crashes after epoch 3 with OOM errors on some runs
but not others. Help me systematically fix this."
Approach:
1. Decompose into diagnostic sub-tasks:
- Memory profiling (identify peak usage)
- Batch size analysis (which sizes trigger OOM)
- Gradient accumulation as workaround
- Model architecture memory audit
- Data loader memory leak check
2. Initialize buffer with the crash trace as first entry.
3. Score LP:
- Memory profiling: difficulty=easy, improvability=high (no data yet) -> LP=high
- Batch size: difficulty=easy, improvability=medium -> LP=medium
- Gradient accumulation: difficulty=medium, improvability=low (workaround) -> LP=low
4. Start with memory profiling. Generate 3 candidate approaches:
A: torch.cuda.memory_summary() at each epoch
B: pytorch memory snapshot with torch.cuda.memory._record_memory_history()
C: nvidia-smi polling script every 5 seconds
5. Execute group, compare informativeness (reward = diagnostic clarity):
A: shows 12GB spike at epoch 3 in attention layer -> advantage=+0.4
B: pinpoints exact tensor allocation -> advantage=+0.5
C: shows GPU utilization but not per-tensor -> advantage=-0.9
-> Select B
6. With root cause identified (attention cache not freed between epochs),
update buffer. "Model architecture memory audit" now has highest LP
because we know exactly where to look.
7. Generate fix candidates:
A: Add torch.cuda.empty_cache() between epochs
B: Use gradient checkpointing in attention layers
C: Reduce attention head dimension
Execute -> B resolves OOM with minimal accuracy loss.
Output:
Root cause: Attention layer KV-cache retained across epochs
Fix: Gradient checkpointing (2% slower training, 40% less peak memory)
Verified: 10 consecutive runs without OOM
Example 3: Curriculum-Ordered Feature Development
User: "I need to add 6 features to my ML service. Some are harder than
others. Help me order them for maximum learning and momentum."
Approach:
1. List features with estimated difficulty:
- REST API endpoint for predictions (easy)
- Input validation with schema (easy)
- Model versioning with rollback (medium)
- A/B testing framework (hard)
- Real-time monitoring dashboard (medium)
- Auto-retraining pipeline (hard)
2. Score LP for each (improvability = business value * technical novelty):
- REST API: LP = 1.0 * 0.3 = 0.3 (easy, foundational, but low novelty)
- Input validation: LP = 1.0 * 0.4 = 0.4
- Model versioning: LP = 0.9 * 0.7 = 0.63
- A/B testing: LP = 0.0 * 0.9 = 0.0 (out of difficulty band -- too hard now)
- Monitoring: LP = 0.9 * 0.6 = 0.54
- Auto-retraining: LP = 0.0 * 0.8 = 0.0 (out of band)
3. Recommended order: Input validation -> Model versioning -> Monitoring
-> REST API -> (reassess: A/B testing and auto-retraining now in band)
-> A/B testing -> Auto-retraining
4. After each feature, update difficulty estimates. Completing model
versioning makes A/B testing easier (shared infrastructure), so its
difficulty drops into band at reassessment.
Output: Ordered backlog with LP justifications and dependency notes.
Paper: AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering (Cai et al., 2026). Key sections: Section 3 for the Evolving Data Buffer and Learnability Potential function; Section 4 for GRPO training details and the group-relative advantage formula; Section 5 for MLE-Bench results showing 100% valid submission rate and ablations demonstrating the 18% gain from LP-based sampling over uniform selection.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".