artifacts/bundle/skills/engineering/autoresearch-agent/SKILL.md
# Autoresearch Agent > You sleep. The agent experiments. You wake up to results. Autonomous experiment loop inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely. Not one guess — fifty measured attempts, compounding. --- ## Slash Commands | Command | What it does | | -----------
npx skillsauth add neekware/ehayeskills artifacts/bundle/skills/engineering/autoresearch-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
Not one guess — fifty measured attempts, compounding.
| Command | What it does |
| ------------ | ---------------------------------------------------------------------------------- |
| /ar:setup | Set up a new experiment interactively |
| /ar:run | Run a single experiment iteration |
| /ar:loop | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
| /ar:status | Show dashboard and results |
| /ar:resume | Resume a paused experiment |
Recognize these patterns from the user:
If the user describes a target file + a way to measure success → this skill applies.
Run the setup script. The user decides where experiments live:
Project-level (inside repo, git-tracked, shareable with team):
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval "pytest bench.py --tb=no -q" \
--metric p50_ms \
--direction lower \
--scope project
User-level (personal, in ~/.autoresearch/):
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval "python evaluate.py" \
--metric ctr_score \
--direction higher \
--evaluator llm_judge_content \
--scope user
The --scope flag determines where .autoresearch/ lives:
project (default) → .autoresearch/ in the repo root. Experiment definitions are git-tracked. Results are gitignored.user → ~/.autoresearch/ in the home directory. Everything is personal..autoresearch/
├── config.yaml ← Global settings
├── .gitignore ← Ignores results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← Objectives, constraints, strategy
├── config.cfg ← Target, eval cmd, metric, direction
├── results.tsv ← Experiment log (gitignored)
└── evaluate.py ← Evaluation script (if --evaluator used)
results.tsv columns: commit | metric | status | description
commit — short git hashmetric — float value or "N/A" for crashesstatus — keep | discard | crashdescription — what changed or why it crashed| Domain | Use Cases |
| ------------- | ----------------------------------------------------------- |
| engineering | Code speed, memory, bundle size, test pass rate, build time |
| marketing | Headlines, social copy, email subjects, ad copy, engagement |
| content | Article structure, SEO descriptions, readability, CTR |
| prompts | System prompts, chatbot tone, agent instructions |
| custom | Anything else with a measurable metric |
program.md Already ExistsThe user may have written their own program.md. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
You are the loop. The scripts handle setup and evaluation — you handle the creative work.
.autoresearch/{domain}/{name}/config.cfg to get:
target — the file you editevaluate_cmd — the command that measures your changesmetric — the metric name to look for in eval outputmetric_direction — "lower" or "higher" is bettertime_budget_minutes — max time per evaluationprogram.md for strategy, constraints, and what you can/cannot changeresults.tsv for experiment history (columns: commit, metric, status, description)git checkout autoresearch/{domain}/{name}git add {target} && git commit -m "experiment: {description}"python scripts/run_experiment.py --experiment {domain}/{name} --singlegit reset --hard HEAD~1)# Single iteration (the agent calls this repeatedly)
python scripts/run_experiment.py --experiment engineering/api-speed --single
# Dry run (test setup before starting)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
After every 10 experiments, review results.tsv for patterns. Update the Strategy section of program.md with what you learned (e.g., "caching changes consistently improve by 5-10%", "refactoring attempts never improve the metric"). Future iterations benefit from this accumulated knowledge.
evaluate.py is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.Ready-to-use evaluation scripts. Copied into the experiment directory during setup with --evaluator.
| Evaluator | Metric | Use Case |
| ----------------- | ----------------------- | ------------------------------- |
| benchmark_speed | p50_ms (lower) | Function/API execution time |
| benchmark_size | size_bytes (lower) | File, bundle, Docker image size |
| test_pass_rate | pass_rate (higher) | Test suite pass percentage |
| build_speed | build_seconds (lower) | Build/compile/Docker build time |
| memory_usage | peak_mb (lower) | Peak memory during execution |
| Evaluator | Metric | Use Case |
| ------------------- | -------------------------------- | ---------------------------------- |
| llm_judge_content | ctr_score 0-10 (higher) | Headlines, titles, descriptions |
| llm_judge_prompt | quality_score 0-100 (higher) | System prompts, agent instructions |
| llm_judge_copy | engagement_score 0-10 (higher) | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside evaluate.py — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
If no built-in evaluator fits, the user writes their own evaluate.py. Only requirement: it must print metric_name: value to stdout.
#!/usr/bin/env python3
# My custom evaluator — DO NOT MODIFY after experiment starts
import subprocess
result = subprocess.run(["my-benchmark", "--json"], capture_output=True, text=True)
# Parse and output
print(f"my_metric: {parse_score(result.stdout)}")
# Single experiment
python scripts/log_results.py --experiment engineering/api-speed
# All experiments in a domain
python scripts/log_results.py --domain engineering
# Cross-experiment dashboard
python scripts/log_results.py --dashboard
# Export formats
python scripts/log_results.py --experiment engineering/api-speed --format csv --output results.csv
python scripts/log_results.py --experiment engineering/api-speed --format markdown --output results.md
python scripts/log_results.py --dashboard --format markdown --output dashboard.md
DOMAIN EXPERIMENT RUNS KEPT BEST Δ FROM START STATUS
engineering api-speed 47 14 185ms -76.9% active
engineering bundle-size 23 8 412KB -58.3% paused
marketing medium-ctr 31 11 8.4/10 +68.0% active
prompts support-tone 15 6 82/100 +46.4% done
Flag these without being asked:
git init && git add . && git commit -m 'initial' first.git clone https://github.com/alirezarezvani/claude-skills.git
cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/
./scripts/convert.sh --skill autoresearch-agent --tool codex|gemini|cursor|windsurf|openclaw
clawhub install cs-autoresearch-agent
Creator: Alireza Rezvani License: MIT Source Repo:
neekware/ehaye-skillsSource Bucket:engineeringOriginal Path:engineering/autoresearch-agent
tools
# ehAye Multimedia Use this skill for **video, audio, images, media conversion, previews, transcription, thumbnails, frame extraction, Spotter visual search, or FFmpeg-backed processing**. Core rule: use ehAye native media tools first. Do not reach first for shell `ffmpeg`, `ffprobe`, Python, or `mediainfo` when a native media tool can do the job. Native tools use bundled engines, show proper tool UI, respect cancellation/timeouts, integrate with Preview/Spotter, and avoid cross-platform shell
development
Test-driven development skill for writing unit tests, generating test fixtures and mocks, analyzing coverage gaps, and guiding red-green-refactor workflows across Jest, Pytest, JUnit, Vitest, and Mocha. Use when the user asks to write tests, improve test coverage, practice TDD, generate mocks or stubs, or mentions testing frameworks like Jest, pytest, or JUnit. Handles test generation from source code, coverage report parsing (LCOV/JSON/XML), quality scoring, and framework conversion for TypeScript, JavaScript, Python, and Java projects.
tools
Help a user set up Telegram for ehAye Dojo. Default to Personal private bots (recommended). Group setup is advanced for teams/observers/demos.
development
# Writing Skills ## Overview **Writing skills IS Test-Driven Development applied to process documentation.** **Personal skills live in agent-specific directories (`~/.claude/skills` for Claude Code, `~/.agents/skills/` for Codex)** You write test cases (pressure scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch tests pass (agents comply), and refactor (close loopholes). **Core principle:** If you didn't watch an agent fail without the ski