skills/autoresearch/SKILL.md
Use this skill when you want to run an autonomous ML research loop that iteratively modifies training code, runs timed experiments, evaluates a single metric, and keeps only improvements — repeating indefinitely until manually stopped. Activate when the user wants to autonomously discover better model architectures, hyperparameters, or training strategies overnight or over a defined session, with full cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows).
npx skillsauth add aviskaar/open-org autoresearchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autonomously run an indefinite ML research loop: modify training code, run a timed experiment, measure one metric, keep improvements, discard regressions — repeat until manually stopped. Inspired by Karpathy's autoresearch and extended with cross-platform support (CUDA / Apple Silicon MPS / CPU / Windows) and user-configurable experiment parameters.
AutoResearch turns an AI agent into a tireless overnight researcher. Given a training codebase and a metric to minimize (or maximize), the agent continuously proposes code modifications, runs fixed-budget training runs, logs results, and builds a growing record of what works — without requiring human involvement between iterations.
The core loop:
┌──────────────────────────────────────────────────────────────────┐
│ AUTONOMOUS RESEARCH LOOP │
│ │
│ [1] Read current [2] Propose [3] Apply change │
│ codebase → modification → to train.py │
│ ↑ ↓ │
│ [7] Keep / discard [5] Evaluate [4] Commit & │
│ & iterate ← metric ← train (T min) │
│ ↑ ↓ │
│ [6] Log to ────────────────────────────── │
│ results.tsv │
└──────────────────────────────────────────────────────────────────┘
Key design principles (from Karpathy's autoresearch):
train.py or its equivalent) — data prep and evaluation are read-onlyresults.tsv — a permanent, auditable recordBefore touching any code, collect all required parameters. Ask explicitly for any missing values.
| # | Parameter | Description | Default if omitted |
|---|-----------|-------------|-------------------|
| 1 | Training script path | Path to the file the agent will modify (e.g., train.py) | train.py in cwd |
| 2 | Read-only script path | Path to the file the agent must never modify (e.g., prepare.py) | prepare.py in cwd |
| 3 | Time budget per experiment | How long each training run is allowed to run | Ask — do not assume |
| 4 | Metric to optimize | Metric name and direction (minimize val_bpb, maximize val_acc, etc.) | Ask — do not assume |
| 5 | Max iterations | How many experiment cycles to run (or unlimited for indefinite) | unlimited |
| 6 | Allowed modification scope | What can be changed: architecture / hyperparameters / optimizer / all | all |
| 7 | Branch tag | Short identifier for the research branch (e.g., mar8, exp-attention) | Date-based (e.g., mar8) |
| 8 | Results file | Where to log results | results.tsv in cwd |
The original autoresearch fixed the training window at 5 minutes. This skill lets the user choose:
"How long should each training run be? (e.g., 3 minutes, 10 minutes, 30 minutes)"
Likewise, the user decides:
Produce and confirm a Research Config before proceeding:
## AutoResearch Config
| Parameter | Value |
|------------------------|-------------------------------|
| Training script | train.py |
| Read-only script | prepare.py |
| Time budget / run | [USER SPECIFIED] |
| Metric | [METRIC NAME] ([min/max]) |
| Max iterations | [N or unlimited] |
| Modification scope | [architecture / hyper / all] |
| Branch | autoresearch/[TAG] |
| Results file | results.tsv |
| Platform | [detected — see Phase 1] |
Get explicit user confirmation before starting Phase 1.
Detect the compute backend and configure the training environment accordingly.
Run this detection sequence before touching any training code:
# Pseudo-detection logic (adapt to actual runtime)
import torch
if torch.cuda.is_available():
platform = "CUDA"
device = "cuda"
notes = "FlashAttention-3 may be available; use torch.compile if supported"
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
platform = "MPS (Apple Silicon)"
device = "mps"
notes = "Disable torch.compile; use SDPA instead of FlashAttention; lower batch size for Metal bounds; cast optimizer states explicitly"
else:
platform = "CPU"
device = "cpu"
notes = "Use smaller batch sizes; expect slower iteration; torch.compile may help"
Windows note: On Windows, MPS is not available. CUDA detection is the same as Linux. CPU fallback applies for systems without a supported GPU.
Apply these adjustments based on detected platform before the first run:
| Platform | Adjustments |
|----------|-------------|
| CUDA | Enable torch.compile if PyTorch >= 2.0. FlashAttention available if installed. Standard batch sizes apply. |
| MPS (Apple Silicon) | Disable torch.compile (unsupported paths). Replace FlashAttention with PyTorch native SDPA + manual sliding window causal masking. Lower device batch size (reduce by 2–4×). Explicitly cast optimizer states to float32 if Metal errors appear. |
| CPU | Disable torch.compile. Use smaller batch sizes. Gradient accumulation to maintain effective batch. Expect 5–10× slower runs — advise user to use shorter time budgets. |
Log the detected platform and applied adjustments in the research config.
git checkout -b autoresearch/<tag> from the main branch.results.tsv with a header row and the baseline entry:commit val_bpb mem_gb status description
baseline [BASELINE_VALUE] [MEM] keep Baseline — no modifications
This loop runs indefinitely (or until the configured max iterations). Do NOT pause to ask the human between iterations. Work continuously.
At the beginning of each iteration:
train.py (or equivalent) in full.results.tsv to understand what has been tried and what worked.Propose one focused modification per iteration. Good candidates include:
Architecture changes:
Optimizer changes:
Training dynamics:
Hypothesis discipline:
exp: try ReLU² activation instead of GELU).# Example — adapt to actual metric name and log format
uv run train.py 2>&1 | tee run.log
VAL_BPB=$(grep "val_bpb:" run.log | tail -1 | awk '{print $2}')
Compare the extracted metric to the current best:
| Outcome | Action |
|---------|--------|
| Improved (metric better by any amount) | Keep the commit. Update current best. Mark keep in results. |
| No change or regression | Discard: git reset --hard HEAD~1. Mark discard in results. |
| Crash / error | Discard. Log the failure reason. Mark crash in results. Do not retry the same change. |
| Memory OOM | If on MPS: reduce device batch size by 2×, retry once. If still OOM: discard. If on CUDA: reduce batch or enable gradient checkpointing, retry once. |
Simplicity principle (from Karpathy's autoresearch): If two configurations achieve the same metric, prefer the one with fewer lines of code. A deletion that maintains performance is better than an addition.
After every run (kept or discarded), append to results.tsv:
<7-char-commit> <metric-value> <mem_gb> <keep|discard|crash> <one-line description>
Example:
a3f12bc 0.9821 43.2 keep ReLU² activation: -0.0158 improvement
d9e44a1 1.0023 44.1 discard Deeper MLP: regression
c0011f2 — — crash OOM: doubled batch size without accumulation fix
Every 10 iterations (or when manually queried), emit a progress report:
## AutoResearch Progress — Iteration [N]
**Best so far:** [METRIC_VALUE] (iteration [K], commit [HASH])
**vs. Baseline:** [DELTA] ([+/-]%)
**Iterations:** [N completed] / [max or unlimited]
**Time elapsed:** [HH:MM]
**Recent results (last 5):**
| Iter | Commit | Metric | Status | Change |
|------|--------|--------|--------|--------|
| N | abc1234| 0.9821 | keep | ReLU² activation |
| N-1 | def5678| 1.0023 | discard| Deeper MLP |
| ... | | | | |
**Current direction:** [what the agent is exploring next]
The loop stops when:
When stopped, produce a Research Summary document (autoresearch-summary.md):
# AutoResearch Summary
**Research session:** autoresearch/<tag>
**Platform:** [CUDA / MPS / CPU]
**Date:** [date range]
**Total iterations:** [N]
**Time budget / run:** [T minutes]
## Results
| Baseline metric | Best metric | Improvement | Best commit |
|-----------------|-------------|-------------|-------------|
| [BASELINE] | [BEST] | [DELTA] | [HASH] |
## Top Improvements (kept commits)
| Rank | Commit | Metric | Description |
|------|--------|--------|-------------|
| 1 | abc123 | 0.9612 | [description] |
| 2 | def456 | 0.9744 | [description] |
| ... | | | |
## What Didn't Work
| Category | Description | Outcome |
|----------|-------------|---------|
| [category] | [description] | [why it failed] |
## Recommended Next Steps
- [3–5 actionable recommendations based on the session findings]
After summarizing:
| Feature | CUDA (Linux/Windows) | MPS (macOS) | CPU (any OS) |
|---------|---------------------|-------------|--------------|
| torch.compile | ✅ Enabled | ❌ Disabled | ✅ Optional |
| FlashAttention | ✅ If installed | ❌ Use SDPA | ❌ Use SDPA |
| Sliding window attn | ✅ via FA or SDPA | ✅ SDPA + manual mask | ✅ SDPA + manual mask |
| Mixed precision (bf16) | ✅ | ✅ (M2+) | ❌ Use fp32 |
| torch.compile modes | reduce-overhead | Disabled | reduce-overhead |
| Batch size guidance | Full config | Reduce 2–4× | Reduce 4–8× |
| Optimizer state casting | Not needed | fp32 explicit cast | Not needed |
The agent may only modify the training script (train.py or equivalent). It must never modify:
prepare.py or equivalent)The agent must never install new packages mid-session (all dependencies must already be available).
AutoResearch is designed to plug directly into the research pipeline as a specialized experiment execution layer:
lead-researcher: A hypothesis about a training approach to explore (e.g., "try hierarchical attention with reduced context window").experiment-design: A structured experiment plan specifying what to vary, what to hold fixed, and what metric to optimize.principal-scientist: A results.tsv and autoresearch-summary.md with the best configuration found and all supporting evidence.research-writing: A structured table of experiments suitable for the methodology and results sections of a paper.When operating within the research team, the research config (Phase 0) is pre-populated by the lead-researcher or experiment-design teammates rather than entered manually by the user.
| User intent | Entry point | |-------------|-------------| | "Run autoresearch overnight on my training script" | Phase 0 (full config) → Phase 1 → Phase 2 loop | | "I have a hypothesis — test it and keep going" | Phase 0 (pass hypothesis as first iteration direction) → loop | | "Run this for exactly 20 iterations and report" | Phase 0 (max_iterations=20) → loop → Phase 3 | | "Continue a previous session" | Phase 1.3 (skip branch creation, re-read existing results.tsv) → Phase 2 | | "Analyze what happened in last night's session" | Phase 3 only — summarize existing results.tsv |
| Phase | Artifact |
|-------|----------|
| 0 | Research Config (confirmed by user) |
| 1 | Platform detection log, baseline entry in results.tsv |
| 2 | Growing results.tsv + per-iteration progress reports |
| 3 | autoresearch-summary.md with ranked improvements and recommendations |
documentation
Replace with a description of the skill and when the agent should use it. Write this as a trigger condition: 'Use this skill when...'
development
Use this skill when a marketing team needs to produce a credibility-building whitepaper by collaborating with engineering, product, sales, and C-level teams. Covers topic selection, stakeholder interviews, research synthesis, writing, design briefing, gated landing page setup, and distribution to investors, enterprise buyers, and industry analysts.
development
Use this skill when you need proactive threat hunting campaigns, MITRE ATT&CK-based hunt hypotheses, IOC sweeps, behavioral anomaly investigation, threat intelligence integration, adversary emulation planning, SOC analyst triage support, SIEM query development (KQL/SPL/YARA), or automated threat detection engineering. Trigger for threat hunting sprints, new threat intel indicators, or post-incident proactive sweeps.
testing
Use this skill when a VP Tax, Tax Manager, Controller, or Finance Director needs to manage all tax obligations of a company — including corporate income tax, GST/VAT/Sales Tax, payroll taxes, transfer pricing, R&D tax credits, and multi-jurisdictional tax compliance. Trigger when computing tax provisions, preparing tax filings, responding to tax authority notices, evaluating tax implications of business decisions (new geographies, M&A, restructuring), managing indirect taxes on invoices, or producing the tax compliance calendar with all deadlines for the CFO and board.