Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

spartanlabsxyz/simmer-autoresearch

Name: simmer-autoresearch
Author: spartanlabsxyz

mcp/skills/autoresearch/SKILL.md

npx skillsauth add spartanlabsxyz/simmer-sdk simmer-autoresearch

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Simmer Autoresearch

Autonomous experiment loop for trading skill optimization: try ideas, keep what works, discard what doesn't, never stop.

Based on pi-autoresearch (MIT).

Tools

init_experiment — configure session (name, skill_slug, metric, unit, direction). Call again to re-initialize with a new baseline.
run_experiment — runs skill command, times it, captures output.
log_experiment — records result. Use keep ONLY when the primary metric improved vs the baseline. discard if worse or unchanged. Zero trades or metric=0 is no-signal — discard, never keep. crash if the skill failed. checks_failed if post-run validation failed. keep auto-commits via git; the others auto-revert. Always include secondary metrics dict. State the before→after comparison in description (e.g., "entry_threshold 0.05→0.03; $12→$18 pnl, keep"). Optionally include asi (Actionable Side Information) for structured diagnostics.
backtest_experiment — replay historical trades against new config without live execution. Fast config tuning (seconds vs hours). Requires trades with signal_data (SDK 0.9.17+).

Setup

Pick a skill to optimize and a primary metric (usually P&L).
git checkout -b autoresearch/<skill>-<date>
Read the skill source code thoroughly — understand what it does before mutating.
Write autoresearch.md — session spec with goal, metrics, how to run, constraints.
Write autoresearch.sh — single command that runs the skill and outputs results.
Commit both files.
init_experiment → run baseline with run_experiment → log_experiment → start looping.

autoresearch.md template

# Autoresearch: <goal>

## Objective
<What we're optimizing and the workload.>

## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...

## How to Run
`./autoresearch.sh` — runs the skill for one cycle.

## Constraints
- Only modify files in <skill directory>
- Do not change SDK core code
- Sim venue only (no real money)

The Loop

Each iteration:

Form a hypothesis (what change might improve the metric?)
Mutate skill code or config
run_experiment — execute the skill
log_experiment — compare metric to baseline. Improved → keep. Worse or equal → discard. Crashed → crash. Include the before→after comparison in description.

Code mutations > config tuning. Structural changes (new data sources, different models, alternative strategies) find bigger wins than parameter tweaks.

Use backtest_experiment for fast config exploration before committing to live runs.

Rules

Primary metric is king. Improved → keep. Worse or equal → discard. Secondary metrics rarely override this — only discard a primary improvement if a secondary metric degraded catastrophically, and explain why in description.
No signal = discard, never keep. If the experiment produced 0 trades, metric=0, or no measurable signal, this is a degenerate run — the skill stopped doing the thing you're trying to optimize. Discard so the next iteration tries something different. If you see this twice in a row, stop the loop and investigate the skill itself before mutating further. A dead loop produces meaningless commits and burns runs.
State the comparison in description. Every log_experiment should make the before→after explicit (e.g., "reduced entry threshold 0.05→0.03; $12→$18 pnl, 4→6 trades, keep"). This is load-bearing for future iterations and for the dashboard to reason about your decisions.
Never skip the baseline run. The first experiment establishes the reference point.
Always log — even crashes. The data matters for confidence scoring.
Check confidence before trusting results. >=2x noise floor = likely real. <1x = within noise. 1-2x = marginal, re-run to confirm.
Don't chase noise. If confidence is low, the improvement may be random. Try a different approach instead of refining a noisy one.
Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
Simpler is better. Removing code for equal perf = keep. Ugly complexity for tiny gain = probably discard.
Write ideas to autoresearch.ideas.md. Promising but deferred optimizations go here. Check it for experiment paths.

NEVER STOP. The user may be away for hours. Keep the loop running until interrupted.

Crash Recovery

Baseline crash -> autoresearch pauses. The skill is misconfigured or the run command is wrong. Fix, then call init_experiment to start fresh.
3 consecutive crashes -> autoresearch pauses. Something is systematically broken. Investigate: read crash outputs, check git status, try running manually.
Context compaction -> re-read autoresearch.md and autoresearch.jsonl to restore context.

Configuration

Set via environment variables on the MCP server:

| Variable | Default | Purpose | |----------|---------|---------| | SIMMER_API_KEY | (required) | API key for dashboard sync and backtest | | SIMMER_API_URL | https://api.simmer.markets | API base URL | | AUTORESEARCH_MAX_EXPERIMENTS | 50 | Max experiments per session (0 = unlimited) |

spartanlabsxyz/simmer-autoresearch

mcp/skills/autoresearch/SKILL.md

Set up and run autonomous experiment loops to optimize Simmer trading skills. Mutates skill code + config, measures P&L, keeps what works. Use when asked to "optimize a skill", "run autoresearch", or "improve my trading".

45 stars

development

Updated May 20, 2026

$ install --global

skillsauth

npx skillsauth add spartanlabsxyz/simmer-sdk simmer-autoresearch

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 20, 2026, 5:38 AM257.2s1 file scanned

SKILL.md

name:: simmer-autoresearch
description:: Set up and run autonomous experiment loops to optimize Simmer trading skills. Mutates skill code + config, measures P&L, keeps what works. Use when asked to "optimize a skill", "run autoresearch", or "improve my trading".

Simmer Autoresearch

Autonomous experiment loop for trading skill optimization: try ideas, keep what works, discard what doesn't, never stop.

Based on pi-autoresearch (MIT).

Tools

init_experiment — configure session (name, skill_slug, metric, unit, direction). Call again to re-initialize with a new baseline.
run_experiment — runs skill command, times it, captures output.
log_experiment — records result. Use keep ONLY when the primary metric improved vs the baseline. discard if worse or unchanged. Zero trades or metric=0 is no-signal — discard, never keep. crash if the skill failed. checks_failed if post-run validation failed. keep auto-commits via git; the others auto-revert. Always include secondary metrics dict. State the before→after comparison in description (e.g., "entry_threshold 0.05→0.03; $12→$18 pnl, keep"). Optionally include asi (Actionable Side Information) for structured diagnostics.
backtest_experiment — replay historical trades against new config without live execution. Fast config tuning (seconds vs hours). Requires trades with signal_data (SDK 0.9.17+).

Setup

Pick a skill to optimize and a primary metric (usually P&L).
git checkout -b autoresearch/<skill>-<date>
Read the skill source code thoroughly — understand what it does before mutating.
Write autoresearch.md — session spec with goal, metrics, how to run, constraints.
Write autoresearch.sh — single command that runs the skill and outputs results.
Commit both files.
init_experiment → run baseline with run_experiment → log_experiment → start looping.

autoresearch.md template

# Autoresearch: <goal>

## Objective
<What we're optimizing and the workload.>

## Metrics
- **Primary**: <name> (<unit>, lower/higher is better)
- **Secondary**: <name>, <name>, ...

## How to Run
`./autoresearch.sh` — runs the skill for one cycle.

## Constraints
- Only modify files in <skill directory>
- Do not change SDK core code
- Sim venue only (no real money)

The Loop

Each iteration:

Form a hypothesis (what change might improve the metric?)
Mutate skill code or config
run_experiment — execute the skill
log_experiment — compare metric to baseline. Improved → keep. Worse or equal → discard. Crashed → crash. Include the before→after comparison in description.

Code mutations > config tuning. Structural changes (new data sources, different models, alternative strategies) find bigger wins than parameter tweaks.

Use backtest_experiment for fast config exploration before committing to live runs.

Rules

Primary metric is king. Improved → keep. Worse or equal → discard. Secondary metrics rarely override this — only discard a primary improvement if a secondary metric degraded catastrophically, and explain why in description.
No signal = discard, never keep. If the experiment produced 0 trades, metric=0, or no measurable signal, this is a degenerate run — the skill stopped doing the thing you're trying to optimize. Discard so the next iteration tries something different. If you see this twice in a row, stop the loop and investigate the skill itself before mutating further. A dead loop produces meaningless commits and burns runs.
State the comparison in description. Every log_experiment should make the before→after explicit (e.g., "reduced entry threshold 0.05→0.03; $12→$18 pnl, 4→6 trades, keep"). This is load-bearing for future iterations and for the dashboard to reason about your decisions.
Never skip the baseline run. The first experiment establishes the reference point.
Always log — even crashes. The data matters for confidence scoring.
Check confidence before trusting results. >=2x noise floor = likely real. <1x = within noise. 1-2x = marginal, re-run to confirm.
Don't chase noise. If confidence is low, the improvement may be random. Try a different approach instead of refining a noisy one.
Don't thrash. Repeatedly reverting the same idea? Try something structurally different.
Simpler is better. Removing code for equal perf = keep. Ugly complexity for tiny gain = probably discard.
Write ideas to autoresearch.ideas.md. Promising but deferred optimizations go here. Check it for experiment paths.

NEVER STOP. The user may be away for hours. Keep the loop running until interrupted.

Crash Recovery

Baseline crash -> autoresearch pauses. The skill is misconfigured or the run command is wrong. Fix, then call init_experiment to start fresh.
3 consecutive crashes -> autoresearch pauses. Something is systematically broken. Investigate: read crash outputs, check git status, try running manually.
Context compaction -> re-read autoresearch.md and autoresearch.jsonl to restore context.

Configuration

Set via environment variables on the MCP server:

Related Skills

spartanlabsxyz/polymarket-worldcup-copytrader

data-ai

VerifiedTrustedCommunity

Copy the top World Cup traders on Polymarket — auto-curated daily by Simmer. No wallet list to configure; the skill sources leaders via PolyNode's slippage-adjusted copy-PnL screen. Regular mode (daily rebalance). Free tier.

45SKILL.mdUpdated Jun 11, 2026

spartanlabsxyz/polymarket-worldcup-copytrader

spartanlabsxyz/mcp/tests/fixtures

tools

VerifiedTrustedCommunity

# Fixture Instruction-Only Skill This is a Tier-A instruction-only fixture used to verify that invoking an instruction-only skill returns its SKILL.md playbook instead of an error. UNIQUE_FIXTURE_MARKER_4815162342

45SKILL.mdUpdated Jun 10, 2026

spartanlabsxyz/mcp/tests/fixtures

spartanlabsxyz/polymarket-soccer-shock-ladder

development

VerifiedTrustedCommunity

Fade sharp in-play price shocks on Polymarket soccer markets with a laddered limit-buy strategy (Roan's FIFA-quant framework). Pro skill. Currently scoped to 2026 World Cup markets. Simmer's server detects shocks in real time and emits pre-sized signals; this skill places the recovery ladder and manages the exit.

45SKILL.mdUpdated Jun 8, 2026

spartanlabsxyz/polymarket-soccer-shock-ladder

spartanlabsxyz/polymarket-dca-eval-trader

development

VerifiedTrustedCommunity

Build and optionally execute a three-tranche Polymarket DCA plan with prop-firm-shaped evaluation envelope checks. Use when the user wants a Bubbles/Roya-style staged averaging template for one thesis, with paper mode by default and explicit live opt-in.

45SKILL.mdUpdated May 31, 2026

spartanlabsxyz/polymarket-dca-eval-trader

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/spartanlabsxyz/simmer-sdk.git

# Copy into Claude Code skills folder (global)
cp -r simmer-sdk/mcp/skills/autoresearch ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

spartanlabsxyz/simmer-sdk

45 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT