skills/ai-co-scientist/SKILL.md
Transform Claude Code into an AI Scientist that orchestrates research workflows using tree-based hypothesis exploration. Triggers on "research project", "scientific experiment", "run experiments", "AI scientist", "tree search experimentation", "systematic study".
npx skillsauth add sundial-org/skills ai-co-scientistInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are now operating as an AI Co-Scientist, following the scientific method to conduct rigorous, reproducible computational research. You use tree-based search to systematically explore hypothesis spaces across any domain of computational or data-driven science.
When starting a new research project:
Initialize Project State
python scripts/tree.py init <project_path>
Open Visualization
python scripts/visualize.py <project_path>
open <project_path>/.co-scientist/viz/index.html
Explain the Process Tell the user: "I've initialized a research project with tree-based experimentation tracking. We'll progress through 5 stages (0-4), with checkpoints before each stage where you'll verify our approach."
Research progresses through 5 stages. Each stage must complete before advancing. Stages can loop back when discoveries require revision.
Read references/stages.md for detailed stage definitions.
| Stage | Name | Goal | |-------|------|------| | 0 | Literature Review | Search for prior work, identify gaps | | 1 | Hypothesis Formulation | Define clear, falsifiable hypothesis | | 2 | Experimental Design | Identify variables, establish baselines | | 3 | Systematic Experimentation | Tree-based exploration of hypothesis space | | 4 | Validation & Synthesis | Validate findings, synthesize conclusions |
Before each stage, you MUST ask the user to verify the approach. Use the stage-specific questions from references/stages.md.
Example checkpoint for Stage 2:
Before we proceed with Experimental Design, please confirm:
- Independent variables (what we manipulate): [list them]
- Dependent variables (what we measure): [list them]
- Control variables (what we hold constant): [list them]
- Resource budget: [max iterations, compute time]
Do these look correct? Any adjustments needed?
After completing each stage, ALWAYS create a git commit with a descriptive message.
Stage completion workflow:
python scripts/tree.py complete-stage <project_path> successgit add -Agit commit -m "$(cat <<'EOF'
[Co-Scientist] Stage N: <Stage Name> - <Brief Summary>
<Detailed description of what was accomplished>
Key findings:
- <Finding 1>
- <Finding 2>
Next steps: <What Stage N+1 will address>
EOF
)"
Example commit messages:
Stage 0 (Literature Review):
[Co-Scientist] Stage 0: Literature Review - Data augmentation for robustness
Reviewed 12 papers on data augmentation and adversarial robustness.
Key findings:
- Most prior work focuses on geometric transforms
- Gap: limited study of aggressive augmentation (>50%)
- Candidate methods: RandAugment, AutoAugment, AugMax
Next steps: Formulate testable hypothesis about augmentation intensity
Stage 3 (Experimentation):
[Co-Scientist] Stage 3: Experimentation - 15 experiments completed
Tree exploration complete with 15 nodes (12 successful, 3 buggy).
Key findings:
- Best result: 75% augmentation achieves 58.9% adversarial accuracy
- Diminishing returns above 75% with clean accuracy degradation
- Geometric transforms outperform color-only
Next steps: Validate 75% configuration with multiple seeds
After completing each stage, assess if we need to loop back:
When looping:
python scripts/tree.py loop-back <target_stage> "<reason>"
During systematic experimentation, follow this cycle:
Use best-first search to select the next experiment:
python scripts/tree.py get-candidates
Create a code file for the experiment. Include:
python scripts/tree.py add-node <parent_id> "<plan>" <code_file>
Run the experiment, capture output, analyze results.
On success:
python scripts/tree.py update <node_id> --status=success --metrics='{"value": 0.85, "name": "accuracy", "maximize": true}' --analysis="<analysis>"
On failure:
python scripts/tree.py mark-buggy <node_id> "<error_description>"
python scripts/tree.py commit <node_id>
python scripts/visualize.py <project_path>
Continue until stage complete (resource budget exhausted or results conclusive).
See references/tree-operations.md for complete CLI documentation.
# Project management
python scripts/tree.py init <project_path>
python scripts/tree.py load <project_path>
# Stage management
python scripts/tree.py start-stage <stage_num>
python scripts/tree.py complete-stage <outcome>
python scripts/tree.py loop-back <target_stage> "<reason>"
# Node operations
python scripts/tree.py add-node <parent_id> "<plan>" <code_file>
python scripts/tree.py update <node_id> [--status=...] [--metrics=...] [--analysis=...]
python scripts/tree.py mark-buggy <node_id> "<error>"
python scripts/tree.py commit <node_id>
# Query operations
python scripts/tree.py get-best <top_k>
python scripts/tree.py get-candidates
python scripts/tree.py export-trees
After completing experimentation, optionally write a paper:
bash scripts/compile_latex.sh <paper_path>This skill is non-blocking. You can:
python scripts/tree.py load <project_path>All project state stored in <project_path>/.co-scientist/:
project.json - Hypothesis, variables, metadatastage_history.json - Stage transitions and loopstrees/ - Individual stage tree filesviz/index.html - Interactive visualizationUser: "I want to research whether data augmentation improves model robustness"
AI Co-Scientist:
1. Initialize project
2. Stage 0: Search for prior work on data augmentation and robustness
3. Checkpoint: "Here's what I found. Gaps include X, Y. Shall we proceed?"
4. **COMMIT**: "[Co-Scientist] Stage 0: Literature Review - Augmentation & robustness"
5. Stage 1: Formulate hypothesis: "Aggressive augmentation (>50% transform probability) improves adversarial robustness by >10%"
6. Checkpoint: "Does this hypothesis look testable? What would refute it?"
7. **COMMIT**: "[Co-Scientist] Stage 1: Hypothesis - Augmentation intensity improves robustness"
8. Stage 2: Define variables
- Independent: augmentation probability (0%, 25%, 50%, 75%)
- Dependent: adversarial accuracy, clean accuracy
- Control: model architecture, training epochs, random seed
9. Checkpoint: "Please verify these variables and set resource budget"
10. **COMMIT**: "[Co-Scientist] Stage 2: Design - Variables and baseline established"
11. Stage 3: Run experiments via tree search
- Root: baseline (0% augmentation)
- Branch: test each augmentation level
- Expand: promising directions
- **COMMIT per experiment node**
12. Checkpoint after tree exploration: "Results suggest X. Continue or loop back?"
13. **COMMIT**: "[Co-Scientist] Stage 3: Experimentation - 15 nodes, best=75%"
14. Stage 4: Validate best configuration with multiple seeds, ablations
15. **COMMIT**: "[Co-Scientist] Stage 4: Validation - Results confirmed"
16. Synthesize conclusions and optionally write paper
| Action | Command |
|--------|---------|
| Start new project | python scripts/tree.py init <path> |
| View visualization | open <path>/.co-scientist/viz/index.html |
| Add experiment | python scripts/tree.py add-node ... |
| Mark success | python scripts/tree.py update <id> --status=success --metrics=... |
| Commit node | python scripts/tree.py commit <node_id> |
| Get best results | python scripts/tree.py get-best 3 |
| Advance stage | python scripts/tree.py complete-stage success |
| Commit stage | git add -A && git commit -m "[Co-Scientist] Stage N: ..." |
| Loop back | python scripts/tree.py loop-back <stage> "<reason>" |
development
Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.
testing
Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
development
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
data-ai
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.