<SUBAGENT-STOP> If you were dispatched as a subagent to run a specific experiment in a sandbox, skip this skill. Execute the task and return structured results immediately. </SUBAGENT-STOP> <HARD-GATE> NEVER execute untrusted code outside a sandbox. If the experiment code contains os.system, subprocess, eval, exec, or network calls, REJECT it before execution. No exceptions. </HARD-GATE>

Experiment Sandbox Skill

Sandboxed experiment execution for research pipelines. Runs generated experiment code in isolated environments with resource limits, metric collection, and reproducibility guarantees.

Sandbox Modes

| Mode | When to Use | Isolation | GPU Support | |------|-------------|-----------|-------------| | local (venv) | Quick experiments, development, no sensitive data | Process-level | Limited | | docker | Reproducible environments, dependency isolation | Container | Yes (with nvidia-docker) | | ssh_remote | GPU compute on lab servers, cloud instances | Remote process | Yes | | colab | Google Colab via Drive bridge | Remote session | Yes (T4/V100) |

Default mode: local. Switch to docker or ssh_remote when GPU or strict reproducibility is required.

Prerequisites

Local Mode

python3 -m venv .venv
source .venv/bin/activate
pip install numpy

Docker Mode

# Build the experiment image
docker build -t research-loop/experiment:latest -f - <<'DOCKERFILE'
FROM python:3.11-slim
RUN pip install numpy
WORKDIR /experiment
DOCKERFILE

# For GPU support (Linux only):
# docker build -t research-loop/experiment:latest -f Dockerfile.gpu .

SSH Remote Mode

# SSH config
Host research-loop-worker
    HostName gpu-server.lab.edu
    User researcher
    IdentityFile ~/.ssh/id_rsa

# Verify connection
ssh research-loop-worker "python3 --version"

Colab Mode

# Mount Google Drive
# Create folder: ~/Library/CloudStorage/GoogleDrive-*/My Drive/research-loop
# In Colab: mount Drive, set WORKDIR, run the experiment

Experiment Harness Template

Every experiment must follow this structure for reliable metric collection:

import numpy as np
import json, time, sys

def main():
    # === SETUP ===
    np.random.seed(42)  # deterministic
    start_time = time.time()
    time_budget = 300   # seconds, from config

    # === TIME ESTIMATE ===
    # Run one condition to estimate
    # Print: TIME_ESTIMATE: Xs

    # === MAIN LOOP ===
    results = {}
    for condition in conditions:
        # Time guard
        elapsed = time.time() - start_time
        if elapsed > 0.8 * time_budget:
            print(f"TIME_GUARD: {elapsed:.1f}s — saving partial results")
            break

        # Run experiment
        metrics = run_condition(condition)

        # Log metrics
        print(f"condition: {condition}")
        for key, val in metrics.items():
            print(f"{key}: {val}")
        results[condition] = metrics

    # === SAVE ===
    with open("results.json", "w") as f:
        json.dump(results, f, indent=2, default=str)
    print(f"Total time: {time.time() - start_time:.1f}s")

if __name__ == "__main__":
    main()

Mandatory outputs:

stdout: metric lines as name: value (one per line)
results.json: structured JSON with all metrics
TIME_ESTIMATE: Xs line before main loop
Time guard at 80% of budget

Code Validation

Before any execution, validate the experiment code:

1. Syntax Check

python3 -c "import py_compile; py_compile.compile('main.py', doraise=True)"

2. Security Scan

Reject code containing:

import os or from os import (except os.path)
import subprocess
import shutil
import socket
eval(, exec(
__import__ (dynamic imports)
open( with write mode (w, a, wb) outside allowed output paths

3. Import Check

python3 -c "
import sys
allowed = {'numpy', 'math', 'random', 'statistics', 'json', 'time', 'collections',
           'itertools', 'functools', 'typing', 'dataclasses'}
for mod in sys.modules:
    if mod not in allowed and mod.startswith('experiment_'):
        print(f'WARNING: unknown import {mod}')
"

4. Determinism Check

Verify that np.random.seed() or random.seed() is called before any random operations. If seeds are missing, the experiment is not reproducible.

Auto-Repair

If validation fails:

Parse the error message
Trace to root cause (not surface symptom)
Fix and re-validate (max 3 attempts)
If still failing after 3 attempts, log the failure and escalate

Running an Experiment

Local Mode

# Setup
cd /tmp/research-loop-experiment-$(uuidgen)
cp -r <experiment_dir>/* .
python3 -m venv .venv
source .venv/bin/activate
pip install numpy 2>&1 | tail -1

# Run with timeout
timeout 300 python3 main.py 2>&1 | tee run.log

# Check results
python3 -c "
import json
with open('results.json') as f:
    data = json.load(f)
print(f'Conditions: {len(data)}')
print(json.dumps(data, indent=2)[:500])
"

Docker Mode

docker run --rm \
  -v $(pwd)/<experiment_dir>:/experiment \
  -w /experiment \
  research-loop/experiment:latest \
  timeout 300 python3 main.py 2>&1 | tee run.log

SSH Remote Mode

# Copy experiment to remote
rsync -avz <experiment_dir>/ research-loop-worker:~/experiments/current/

# Run remotely
ssh research-loop-worker \
  "cd ~/experiments/current && timeout 300 python3 main.py" \
  2>&1 | tee run.log

Colab Mode

# Copy experiment to Drive
cp -r <experiment_dir>/* ~/Library/CloudStorage/GoogleDrive-*/My\ Drive/research-loop/experiments/current/

# In Colab:
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/research-loop/experiments/current
# !timeout 300 python3 main.py

Metric Collection

After the experiment runs, collect and validate metrics:

import json

def collect_metrics(run_dir):
    """Parse stdout and results.json into structured metrics dict."""
    # Parse stdout for "name: value" lines
    metrics = {}
    with open(f"{run_dir}/run.log") as f:
        for line in f:
            line = line.strip()
            if ": " in line and not line.startswith("#"):
                parts = line.split(": ", 1)
                try:
                    metrics[parts[0]] = float(parts[1])
                except ValueError:
                    metrics[parts[0]] = parts[1]

    # Merge with results.json
    try:
        with open(f"{run_dir}/results.json") as f:
            rj = json.load(f)
            metrics["_results_json"] = rj
    except (FileNotFoundError, json.JSONDecodeError):
        pass

    return metrics

Validation checks:

All declared metrics from the experiment plan are present
No NaN or Inf values in numeric metrics
TIME_ESTIMATE was accurate within 20%
Time guard did not fire prematurely (< 50% of budget)

Reproducibility Requirements

Every experiment MUST have:

Deterministic seeds — np.random.seed(42) or equivalent
Full config capture — save config/params used for the run

Environment snapshot — Python version, package versions:

python3 --version > environment.txt
pip freeze >> environment.txt

Timestamp and duration — start time, end time, total seconds
Git commit hash (if in a repo) — git rev-parse HEAD

Compute Budget Enforcement

| Budget | Max Conditions | Max Seeds | Max Iterations/Condition | |--------|---------------|-----------|-------------------------| | < 120s | 1-2 | 1-2 | ≤ 1,000 | | 120-300s | 2-5 | 3-5 | ≤ 5,000 | | 300-900s | 5-10 | 5-10 | ≤ 20,000 | | > 900s | 10-50 | 10-20 | No limit (but guard at 80%) |

Always print TIME_ESTIMATE: Xs based on a 1-condition pilot run before the main experiment loop. Always implement a time guard that saves partial results.

Anti-patterns

Running without validation. Always validate code syntax and security before the first execution. Never skip this step.
Ignoring NaN/Inf. If the experiment produces NaN or Inf values, treat this as a bug. Trace to root cause and fix, do not suppress.
No partial results. If the experiment crashes at 90%, all work is lost. Always write intermediate results periodically.
Non-deterministic runs. Without a fixed seed, the experiment cannot be reproduced. Always set seeds explicitly.
Missing environment capture. Without knowing the Python version and package versions, results are not reproducible. Always run pip freeze.

Verification Checklist

[ ] Code passes syntax check
[ ] Security scan passes (no dangerous imports)
[ ] Deterministic seeds are set before any random operations
[ ] TIME_ESTIMATE line is present in the code
[ ] Time guard is implemented at 80% of budget
[ ] results.json contains all declared metrics
[ ] stdout contains "{metric_name}: {value}" lines
[ ] Environment is captured (python version, pip freeze)
[ ] Experiment ran to completion or saved partial results

Experiment Sandbox Skill

Sandboxed experiment execution for research pipelines. Runs generated experiment code in isolated environments with resource limits, metric collection, and reproducibility guarantees.

Sandbox Modes

Default mode: local. Switch to docker or ssh_remote when GPU or strict reproducibility is required.

Prerequisites

Local Mode

python3 -m venv .venv
source .venv/bin/activate
pip install numpy

Docker Mode

# Build the experiment image
docker build -t research-loop/experiment:latest -f - <<'DOCKERFILE'
FROM python:3.11-slim
RUN pip install numpy
WORKDIR /experiment
DOCKERFILE

# For GPU support (Linux only):
# docker build -t research-loop/experiment:latest -f Dockerfile.gpu .

SSH Remote Mode

# SSH config
Host research-loop-worker
    HostName gpu-server.lab.edu
    User researcher
    IdentityFile ~/.ssh/id_rsa

# Verify connection
ssh research-loop-worker "python3 --version"

Colab Mode

# Mount Google Drive
# Create folder: ~/Library/CloudStorage/GoogleDrive-*/My Drive/research-loop
# In Colab: mount Drive, set WORKDIR, run the experiment

Experiment Harness Template

Every experiment must follow this structure for reliable metric collection:

import numpy as np
import json, time, sys

def main():
    # === SETUP ===
    np.random.seed(42)  # deterministic
    start_time = time.time()
    time_budget = 300   # seconds, from config

    # === TIME ESTIMATE ===
    # Run one condition to estimate
    # Print: TIME_ESTIMATE: Xs

    # === MAIN LOOP ===
    results = {}
    for condition in conditions:
        # Time guard
        elapsed = time.time() - start_time
        if elapsed > 0.8 * time_budget:
            print(f"TIME_GUARD: {elapsed:.1f}s — saving partial results")
            break

        # Run experiment
        metrics = run_condition(condition)

        # Log metrics
        print(f"condition: {condition}")
        for key, val in metrics.items():
            print(f"{key}: {val}")
        results[condition] = metrics

    # === SAVE ===
    with open("results.json", "w") as f:
        json.dump(results, f, indent=2, default=str)
    print(f"Total time: {time.time() - start_time:.1f}s")

if __name__ == "__main__":
    main()

Mandatory outputs:

stdout: metric lines as name: value (one per line)
results.json: structured JSON with all metrics
TIME_ESTIMATE: Xs line before main loop
Time guard at 80% of budget

Code Validation

Before any execution, validate the experiment code:

1. Syntax Check

python3 -c "import py_compile; py_compile.compile('main.py', doraise=True)"

2. Security Scan

Reject code containing:

import os or from os import (except os.path)
import subprocess
import shutil
import socket
eval(, exec(
__import__ (dynamic imports)
open( with write mode (w, a, wb) outside allowed output paths

3. Import Check

python3 -c "
import sys
allowed = {'numpy', 'math', 'random', 'statistics', 'json', 'time', 'collections',
           'itertools', 'functools', 'typing', 'dataclasses'}
for mod in sys.modules:
    if mod not in allowed and mod.startswith('experiment_'):
        print(f'WARNING: unknown import {mod}')
"

4. Determinism Check

Verify that np.random.seed() or random.seed() is called before any random operations. If seeds are missing, the experiment is not reproducible.

Auto-Repair

If validation fails:

Parse the error message
Trace to root cause (not surface symptom)
Fix and re-validate (max 3 attempts)
If still failing after 3 attempts, log the failure and escalate

Running an Experiment

Local Mode

# Setup
cd /tmp/research-loop-experiment-$(uuidgen)
cp -r <experiment_dir>/* .
python3 -m venv .venv
source .venv/bin/activate
pip install numpy 2>&1 | tail -1

# Run with timeout
timeout 300 python3 main.py 2>&1 | tee run.log

# Check results
python3 -c "
import json
with open('results.json') as f:
    data = json.load(f)
print(f'Conditions: {len(data)}')
print(json.dumps(data, indent=2)[:500])
"

Docker Mode

docker run --rm \
  -v $(pwd)/<experiment_dir>:/experiment \
  -w /experiment \
  research-loop/experiment:latest \
  timeout 300 python3 main.py 2>&1 | tee run.log

SSH Remote Mode

# Copy experiment to remote
rsync -avz <experiment_dir>/ research-loop-worker:~/experiments/current/

# Run remotely
ssh research-loop-worker \
  "cd ~/experiments/current && timeout 300 python3 main.py" \
  2>&1 | tee run.log

Colab Mode

# Copy experiment to Drive
cp -r <experiment_dir>/* ~/Library/CloudStorage/GoogleDrive-*/My\ Drive/research-loop/experiments/current/

# In Colab:
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/research-loop/experiments/current
# !timeout 300 python3 main.py

Metric Collection

After the experiment runs, collect and validate metrics:

import json

def collect_metrics(run_dir):
    """Parse stdout and results.json into structured metrics dict."""
    # Parse stdout for "name: value" lines
    metrics = {}
    with open(f"{run_dir}/run.log") as f:
        for line in f:
            line = line.strip()
            if ": " in line and not line.startswith("#"):
                parts = line.split(": ", 1)
                try:
                    metrics[parts[0]] = float(parts[1])
                except ValueError:
                    metrics[parts[0]] = parts[1]

    # Merge with results.json
    try:
        with open(f"{run_dir}/results.json") as f:
            rj = json.load(f)
            metrics["_results_json"] = rj
    except (FileNotFoundError, json.JSONDecodeError):
        pass

    return metrics

Validation checks:

All declared metrics from the experiment plan are present
No NaN or Inf values in numeric metrics
TIME_ESTIMATE was accurate within 20%
Time guard did not fire prematurely (< 50% of budget)

Reproducibility Requirements

Every experiment MUST have:

Deterministic seeds — np.random.seed(42) or equivalent
Full config capture — save config/params used for the run

Environment snapshot — Python version, package versions:

python3 --version > environment.txt
pip freeze >> environment.txt

Timestamp and duration — start time, end time, total seconds
Git commit hash (if in a repo) — git rev-parse HEAD

Compute Budget Enforcement

Always print TIME_ESTIMATE: Xs based on a 1-condition pilot run before the main experiment loop. Always implement a time guard that saves partial results.

Anti-patterns

Running without validation. Always validate code syntax and security before the first execution. Never skip this step.
Ignoring NaN/Inf. If the experiment produces NaN or Inf values, treat this as a bug. Trace to root cause and fix, do not suppress.
No partial results. If the experiment crashes at 90%, all work is lost. Always write intermediate results periodically.
Non-deterministic runs. Without a fixed seed, the experiment cannot be reproduced. Always set seeds explicitly.
Missing environment capture. Without knowing the Python version and package versions, results are not reproducible. Always run pip freeze.

Verification Checklist

[ ] Code passes syntax check
[ ] Security scan passes (no dangerous imports)
[ ] Deterministic seeds are set before any random operations
[ ] TIME_ESTIMATE line is present in the code
[ ] Time guard is implemented at 80% of budget
[ ] results.json contains all declared metrics
[ ] stdout contains "{metric_name}: {value}" lines
[ ] Environment is captured (python version, pip freeze)
[ ] Experiment ran to completion or saved partial results

Adoption

moralespanitz/experiment-sandbox

$ install --global

Security Scan Results

SKILL.md

Experiment Sandbox Skill

Sandbox Modes

Prerequisites

Local Mode

Docker Mode

SSH Remote Mode

Colab Mode

Experiment Harness Template

Code Validation

1. Syntax Check

2. Security Scan

3. Import Check

4. Determinism Check

Auto-Repair

Running an Experiment

Local Mode

Docker Mode

SSH Remote Mode

Colab Mode

Metric Collection

Reproducibility Requirements

Compute Budget Enforcement

Anti-patterns

Verification Checklist

Related Skills

moralespanitz/replication

moralespanitz/paper-pipeline

moralespanitz/literature-review

moralespanitz/figure-agent

moralespanitz/experiment-sandbox

$ install --global

Security Scan Results

SKILL.md

Experiment Sandbox Skill

Sandbox Modes

Prerequisites

Local Mode

Docker Mode

SSH Remote Mode

Colab Mode

Experiment Harness Template

Code Validation

1. Syntax Check

2. Security Scan

3. Import Check

4. Determinism Check

Auto-Repair

Running an Experiment

Local Mode

Docker Mode

SSH Remote Mode

Colab Mode

Metric Collection

Reproducibility Requirements

Compute Budget Enforcement

Anti-patterns

Verification Checklist

Related Skills

moralespanitz/replication

moralespanitz/paper-pipeline

moralespanitz/literature-review

moralespanitz/figure-agent