skills/experiment-sandbox/SKILL.md
Experiment sandbox execution for Research Loop. Supports four modes: local (venv), Docker (isolated containers), SSH remote (GPU compute on servers), and Colab (Google Drive bridge). Provides experiment harness templates, code validation, metric collection, deterministic seeding, and compute budget enforcement. Use before running experiments generated by the paper-pipeline.
npx skillsauth add moralespanitz/research-loop experiment-sandboxInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Sandboxed experiment execution for research pipelines. Runs generated experiment code in isolated environments with resource limits, metric collection, and reproducibility guarantees.
| Mode | When to Use | Isolation | GPU Support |
|------|-------------|-----------|-------------|
| local (venv) | Quick experiments, development, no sensitive data | Process-level | Limited |
| docker | Reproducible environments, dependency isolation | Container | Yes (with nvidia-docker) |
| ssh_remote | GPU compute on lab servers, cloud instances | Remote process | Yes |
| colab | Google Colab via Drive bridge | Remote session | Yes (T4/V100) |
Default mode: local. Switch to docker or ssh_remote when GPU or
strict reproducibility is required.
python3 -m venv .venv
source .venv/bin/activate
pip install numpy
# Build the experiment image
docker build -t research-loop/experiment:latest -f - <<'DOCKERFILE'
FROM python:3.11-slim
RUN pip install numpy
WORKDIR /experiment
DOCKERFILE
# For GPU support (Linux only):
# docker build -t research-loop/experiment:latest -f Dockerfile.gpu .
# SSH config
Host research-loop-worker
HostName gpu-server.lab.edu
User researcher
IdentityFile ~/.ssh/id_rsa
# Verify connection
ssh research-loop-worker "python3 --version"
# Mount Google Drive
# Create folder: ~/Library/CloudStorage/GoogleDrive-*/My Drive/research-loop
# In Colab: mount Drive, set WORKDIR, run the experiment
Every experiment must follow this structure for reliable metric collection:
import numpy as np
import json, time, sys
def main():
# === SETUP ===
np.random.seed(42) # deterministic
start_time = time.time()
time_budget = 300 # seconds, from config
# === TIME ESTIMATE ===
# Run one condition to estimate
# Print: TIME_ESTIMATE: Xs
# === MAIN LOOP ===
results = {}
for condition in conditions:
# Time guard
elapsed = time.time() - start_time
if elapsed > 0.8 * time_budget:
print(f"TIME_GUARD: {elapsed:.1f}s — saving partial results")
break
# Run experiment
metrics = run_condition(condition)
# Log metrics
print(f"condition: {condition}")
for key, val in metrics.items():
print(f"{key}: {val}")
results[condition] = metrics
# === SAVE ===
with open("results.json", "w") as f:
json.dump(results, f, indent=2, default=str)
print(f"Total time: {time.time() - start_time:.1f}s")
if __name__ == "__main__":
main()
Mandatory outputs:
stdout: metric lines as name: value (one per line)results.json: structured JSON with all metricsTIME_ESTIMATE: Xs line before main loopBefore any execution, validate the experiment code:
python3 -c "import py_compile; py_compile.compile('main.py', doraise=True)"
Reject code containing:
import os or from os import (except os.path)import subprocessimport shutilimport socketeval(, exec(__import__ (dynamic imports)open( with write mode (w, a, wb) outside allowed output pathspython3 -c "
import sys
allowed = {'numpy', 'math', 'random', 'statistics', 'json', 'time', 'collections',
'itertools', 'functools', 'typing', 'dataclasses'}
for mod in sys.modules:
if mod not in allowed and mod.startswith('experiment_'):
print(f'WARNING: unknown import {mod}')
"
Verify that np.random.seed() or random.seed() is called before any
random operations. If seeds are missing, the experiment is not reproducible.
If validation fails:
# Setup
cd /tmp/research-loop-experiment-$(uuidgen)
cp -r <experiment_dir>/* .
python3 -m venv .venv
source .venv/bin/activate
pip install numpy 2>&1 | tail -1
# Run with timeout
timeout 300 python3 main.py 2>&1 | tee run.log
# Check results
python3 -c "
import json
with open('results.json') as f:
data = json.load(f)
print(f'Conditions: {len(data)}')
print(json.dumps(data, indent=2)[:500])
"
docker run --rm \
-v $(pwd)/<experiment_dir>:/experiment \
-w /experiment \
research-loop/experiment:latest \
timeout 300 python3 main.py 2>&1 | tee run.log
# Copy experiment to remote
rsync -avz <experiment_dir>/ research-loop-worker:~/experiments/current/
# Run remotely
ssh research-loop-worker \
"cd ~/experiments/current && timeout 300 python3 main.py" \
2>&1 | tee run.log
# Copy experiment to Drive
cp -r <experiment_dir>/* ~/Library/CloudStorage/GoogleDrive-*/My\ Drive/research-loop/experiments/current/
# In Colab:
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/research-loop/experiments/current
# !timeout 300 python3 main.py
After the experiment runs, collect and validate metrics:
import json
def collect_metrics(run_dir):
"""Parse stdout and results.json into structured metrics dict."""
# Parse stdout for "name: value" lines
metrics = {}
with open(f"{run_dir}/run.log") as f:
for line in f:
line = line.strip()
if ": " in line and not line.startswith("#"):
parts = line.split(": ", 1)
try:
metrics[parts[0]] = float(parts[1])
except ValueError:
metrics[parts[0]] = parts[1]
# Merge with results.json
try:
with open(f"{run_dir}/results.json") as f:
rj = json.load(f)
metrics["_results_json"] = rj
except (FileNotFoundError, json.JSONDecodeError):
pass
return metrics
Validation checks:
Every experiment MUST have:
np.random.seed(42) or equivalentpython3 --version > environment.txt
pip freeze >> environment.txt
git rev-parse HEAD| Budget | Max Conditions | Max Seeds | Max Iterations/Condition | |--------|---------------|-----------|-------------------------| | < 120s | 1-2 | 1-2 | ≤ 1,000 | | 120-300s | 2-5 | 3-5 | ≤ 5,000 | | 300-900s | 5-10 | 5-10 | ≤ 20,000 | | > 900s | 10-50 | 10-20 | No limit (but guard at 80%) |
Always print TIME_ESTIMATE: Xs based on a 1-condition pilot run before the
main experiment loop. Always implement a time guard that saves partial results.
pip freeze.testing
Plan and execute a structured replication workflow for a paper, claim, or benchmark with environment selection and integrity checks.
testing
End-to-end paper generation pipeline ported from AutoResearchClaw (Aiming Lab). 14 phases covering topic initiation through export/publish, with human- in-the-loop gates and quality gating at each handoff. Use this when the user wants a full paper pipeline run — topic to submission-ready manuscript. Delegates to researcher/reviewer/writer/verifier subagents for stage execution and to autonomous-iteration for experiment optimization loops.
testing
Run a structured literature review on a topic using parallel search, evidence tables with quality scoring, and primary-source synthesis.
development
Publication-quality figure generation for research papers. Decision agent selects figure type (code plot vs architecture diagram). Generates Matplotlib/Seaborn code for quantitative figures with iterative improvement loop. Style-matches conference templates (NeurIPS, ICML, ICLR). Use when the paper-pipeline reaches the figure generation phase, or when a user requests figures for an existing draft.