Hyperagent

Quick Start — Simple Examples

New to Hyperagent? Try these beginner-friendly tasks before the full setup.

1. Optimize a simple Python script to run faster

Say: "Use hyperagent to optimize this script for speed" and paste something like:

# slow_sort.py
def sort_numbers(nums):
    result = []
    while nums:
        smallest = min(nums)
        result.append(smallest)
        nums.remove(smallest)
    return result

Hyperagent will benchmark it, propose a faster implementation, and validate the improvement.

2. Improve a prompt to get better answers

Say: "Run hyperagent on this prompt and improve accuracy" with a prompt like:

Summarize this article in one sentence.

The meta-agent iterates on the prompt, measures quality, and keeps improvements that score higher.

3. Make a sorting function more efficient

Say: "Evolve this function with hyperagent" and paste any function. Hyperagent creates a benchmark, runs generations of improvements, and shows you the performance gain per generation.

4. Self-improve any script

Say: "Self-improve this agent/script" and point to any Python file. Hyperagent wraps it in an evaluation loop, proposes modifications, and tracks what works.

The simplest possible setup: create task.sh that prints METRIC score=0.5, then run python3 scripts/init_session.py. From there the loop is fully automated.

Self-referential self-improvement: a meta-agent that modifies a task-agent (and itself) to optimize any measurable objective.

Inspired by Facebook Research's Hyperagents paper (arXiv:2603.19461), which demonstrated that agents combining a task-solver and a self-modifying meta-level into a single editable program can achieve open-ended, compounding improvements that transfer across domains.

How It Works

A hyperagent is a system with two components in a single editable codebase:

Task Agent — solves the target task (benchmark, code generation, data processing, etc.)
Meta Agent — analyzes task performance history and proposes modifications to the task agent's code (and optionally its own code)

The key insight from the paper: when the meta-level modification procedure is itself editable, the system can improve not just task performance but also the mechanism that generates future improvements — enabling compounding, transferable gains.

Core Principles

Self-referential modification

The meta-agent can modify the task-agent's code AND its own strategy. Both live in the same editable workspace. This enables metacognitive self-improvement: improving how you improve.
Population-based exploration (archive)

Don't just keep the best variant — maintain an archive of all successful variants as stepping stones. Parent selection favors high performers with unexplored potential.
Empirical evaluation gates everything

No change is accepted without measurement. Every candidate is evaluated against the task benchmark with repeated trials.
Persistent memory and performance tracking

The system maintains a structured history of all experiments, hypotheses, and outcomes. Later generations build on earlier insights — no rediscovering dead ends.
Transfer across domains

Meta-level improvements (performance tracking, evaluation strategies, hypothesis generation patterns) are domain-agnostic and can be transferred to new tasks.

Available Scripts

scripts/common.py — shared utilities (archive management, metrics, reporting)
scripts/init_session.py — initialize a hyperagent session, scaffold the workspace
scripts/run_task.py — evaluate a task-agent variant and record metrics
scripts/log_variant.py — log evaluated record, decide disposition, update archive and reports
scripts/render_report.py — generate HTML report of the full evolutionary history
scripts/select_parent.py — select a parent from the archive for the next generation

Note: There is no generate_variant.py script — the meta-agent role (hypothesis generation and code modification) is performed by the LLM agent itself, not by a script.

All scripts are non-interactive, expose --help, emit structured JSON on stdout, and keep diagnostics on stderr.

Default Workflow

Initialize the session after defining the optimization target:

python3 scripts/init_session.py \
  --goal "Improve prompt accuracy on math benchmark" \
  --metric-name accuracy \
  --unit pct \
  --direction higher \
  --task-command ./task.sh \
  --checks-command ./checks.sh \
  --scope src/agent.py \
  --max-generations 50

Evaluate the baseline (generation 0):

python3 scripts/run_task.py \
  --id gen-000 \
  --hypothesis "Control: unmodified task agent" \
  --change-summary "No modifications" \
  --baseline \
  --output .hyperagent/gen-000.json

python3 scripts/log_variant.py --input .hyperagent/gen-000.json

Selection → Modification → Evaluation loop:

# Select a parent from the archive
python3 scripts/select_parent.py --output .hyperagent/parent.json

# Generate a variant (meta-agent proposes modifications)
# This is where YOU (the LLM agent) act as the meta-agent:
# - Read the parent's code and performance history
# - Hypothesize an improvement
# - Apply code modifications
# - Record what you changed and why

# Evaluate the variant
python3 scripts/run_task.py \
  --id gen-001 \
  --hypothesis "Add chain-of-thought prompting to improve reasoning" \
  --change-summary "Wrap task prompt in step-by-step reasoning template" \
  --parent gen-000 \
  --output .hyperagent/gen-001.json

python3 scripts/log_variant.py --input .hyperagent/gen-001.json

Render reports at any time:
```
python3 scripts/render_report.py
```

Up-Front Q&A

Before starting, gather or confirm:

Objective — what are we optimizing?
Primary metric — exact name, unit, direction (lower/higher)
Task command — the script that runs the task agent and emits METRIC name=value lines
Correctness gates — tests or checks that must pass for a variant to be kept
Scope — which files can the meta-agent modify?
Meta-scope — can the meta-agent modify its own strategy? (default: yes)
Generation budget — max generations before stopping
Minimum improvement threshold — default 1%

Workspace Setup

Prefer a dedicated worktree on a fresh branch:

git worktree add ../hyperagent-<goal>-<date> -b hyperagent/<goal>-<date>

Create:
- hyperagent.md — checked in, durable session brief with full evolutionary history
- task.sh — checked in, benchmark runner (emits METRIC name=value)
- checks.sh — checked in, correctness gates
- .hyperagent/ — local artifact directory, NOT checked in

Ensure artifacts stay untracked:

rg -qxF '.hyperagent/' .git/info/exclude || printf '\n.hyperagent/\n' >> .git/info/exclude

The Meta-Agent Role

You (the LLM) are the meta-agent. Your job each generation is:

Select parent — use scripts/select_parent.py or choose based on the archive
Analyze — read the parent's code, performance history, and past experiment outcomes
Hypothesize — propose a specific, testable modification with a causal theory for why it should help
Modify — apply code changes to the task agent (and optionally to your own strategy notes in hyperagent.md)
Evaluate — run scripts/run_task.py to measure the variant
Log — use scripts/log_variant.py to record the result and update the archive
Reflect — update hyperagent.md with what you learned

Meta-Level Self-Modification

The meta-agent can improve its own process by updating:

Strategy notes in hyperagent.md (hypothesis generation patterns, evaluation heuristics)
Memory entries in .hyperagent/memory.jsonl (qualitative insights, correction plans)
The task evaluation protocol (adding secondary metrics, changing trial counts)

These meta-improvements compound across generations and transfer to new tasks.

Required Files

`hyperagent.md`

The durable contract and evolutionary history. A fresh agent can resume from this.

# Hyperagent: <goal>

## Objective
<What is being optimized and why.>

## Configuration
- Primary metric:
- Unit:
- Direction:
- Minimum improvement: X%
- Task command:
- Correctness gates:
- Generation budget:

## Scope
- Task agent files:
- Meta-agent can self-modify: yes/no

## Archive
`.hyperagent/archive.jsonl`

## Lineage
<Tree showing parent→child relationships and which variants were kept>

## Meta-Strategy
<Current approach to hypothesis generation — updated as the meta-agent learns>

## What We've Learned
<Key wins, dead ends, transferable insights>

## Performance Tracking
<Best variant, improvement trajectory, current plateau status>

`task.sh`

Bash script that runs the task agent and emits METRIC name=value lines:

#!/bin/bash
set -euo pipefail
# Run the task agent
python3 src/agent.py --input data/test.json 2>/dev/null
# The agent script should emit: METRIC accuracy=0.85

Archive Structure

The archive (.hyperagent/archive.jsonl) stores every variant ever evaluated:

{
  "id": "gen-007",
  "generation": 7,
  "parent_id": "gen-003",
  "timestamp": "2026-03-27T20:00:00Z",
  "hypothesis": "Add few-shot examples to improve pattern recognition",
  "change_summary": "Inserted 3 domain-specific examples into the task prompt",
  "files_touched": ["src/agent.py"],
  "metric_name": "accuracy",
  "direction": "higher",
  "warmup_trials": [0.82, 0.83],
  "measured_trials": [0.85, 0.86, 0.84, 0.85, 0.87],
  "summary": {"median": 0.85, "mean": 0.854, "min": 0.84, "max": 0.87},
  "checks": "passed",
  "disposition": "keep",
  "children_count": 0,
  "meta_modifications": ["Updated strategy notes with few-shot pattern"],
  "reason": "Improved by 3.2% over parent gen-003 (0.824). Checks passed."
}

Parent Selection

Selection probability for a parent is proportional to:

Performance score (higher is better for archive diversity)
Inverse of children count (favor unexplored high-performers)

This balances exploitation (good variants) with exploration (understudied variants).

python3 scripts/select_parent.py
# Output: {"selected_parent": "gen-003", "score": 0.824, "children": 1, "reason": "High performer with few children"}

Decision Rules

keep — variant beats current best by ≥ threshold, checks pass
discard — variant is worse, equal, or improvement below threshold
checks_failed — metric improved but correctness gates failed
crash — variant could not be evaluated

Plateau Detection

Track improvement velocity. Stop or pivot when:

3+ consecutive generations with no improvement
Hypothesis diversity drops (recycling ideas)
Improvement velocity < 0.1% per generation over last 5

Loop Behavior

Run autonomously until:

Generation budget exhausted
Plateau detected (3 consecutive non-improvements)
All promising hypotheses explored
User interrupts

During the loop:

One hypothesis per generation
Record dead ends explicitly
Keep the worktree clean between variants (revert discarded changes)
Update hyperagent.md after every generation

Common Pitfalls

1. Meta-Agent Overfitting Its Own Strategy

Symptom: Meta-strategy becomes over-specialized to early successes Fix: Periodically review and broaden the strategy; try categorically different approaches

2. Archive Bloat

Symptom: Archive grows large, selection becomes slow Fix: Archive old generations after 50 variants; maintain a compact summary

3. Self-Modification Destabilizing the Loop

Symptom: Meta-agent modifies evaluation or logging in ways that break the loop Fix: Keep outer-loop scripts (init, run, log, select) immutable. Only modify task code and strategy notes.

4. Hypothesis Recycling

Symptom: Later generations retry earlier failed ideas Fix: Always read .hyperagent/memory.jsonl before proposing. Explicitly check against dead ends.

Transfer Protocol

To transfer meta-improvements to a new domain:

Extract meta-strategy from hyperagent.md "What We've Learned" section
Copy .hyperagent/memory.jsonl as starting knowledge
Initialize new session with transferred strategy as initial context
The meta-agent starts with accumulated wisdom instead of from scratch

Report Generation

python3 scripts/render_report.py

Generates .hyperagent/report.html with:

Lineage tree visualization
Performance over generations
Best-so-far trend
Disposition breakdown
Per-variant trial distributions
Meta-strategy evolution timeline

Hyperagent

Quick Start — Simple Examples

New to Hyperagent? Try these beginner-friendly tasks before the full setup.

1. Optimize a simple Python script to run faster

Say: "Use hyperagent to optimize this script for speed" and paste something like:

# slow_sort.py
def sort_numbers(nums):
    result = []
    while nums:
        smallest = min(nums)
        result.append(smallest)
        nums.remove(smallest)
    return result

Hyperagent will benchmark it, propose a faster implementation, and validate the improvement.

2. Improve a prompt to get better answers

Say: "Run hyperagent on this prompt and improve accuracy" with a prompt like:

Summarize this article in one sentence.

The meta-agent iterates on the prompt, measures quality, and keeps improvements that score higher.

3. Make a sorting function more efficient

Say: "Evolve this function with hyperagent" and paste any function. Hyperagent creates a benchmark, runs generations of improvements, and shows you the performance gain per generation.

4. Self-improve any script

Say: "Self-improve this agent/script" and point to any Python file. Hyperagent wraps it in an evaluation loop, proposes modifications, and tracks what works.

The simplest possible setup: create task.sh that prints METRIC score=0.5, then run python3 scripts/init_session.py. From there the loop is fully automated.

Self-referential self-improvement: a meta-agent that modifies a task-agent (and itself) to optimize any measurable objective.

How It Works

A hyperagent is a system with two components in a single editable codebase:

Task Agent — solves the target task (benchmark, code generation, data processing, etc.)
Meta Agent — analyzes task performance history and proposes modifications to the task agent's code (and optionally its own code)

Core Principles

Self-referential modification

The meta-agent can modify the task-agent's code AND its own strategy. Both live in the same editable workspace. This enables metacognitive self-improvement: improving how you improve.
Population-based exploration (archive)

Don't just keep the best variant — maintain an archive of all successful variants as stepping stones. Parent selection favors high performers with unexplored potential.
Empirical evaluation gates everything

No change is accepted without measurement. Every candidate is evaluated against the task benchmark with repeated trials.
Persistent memory and performance tracking

The system maintains a structured history of all experiments, hypotheses, and outcomes. Later generations build on earlier insights — no rediscovering dead ends.
Transfer across domains

Meta-level improvements (performance tracking, evaluation strategies, hypothesis generation patterns) are domain-agnostic and can be transferred to new tasks.

Available Scripts

scripts/common.py — shared utilities (archive management, metrics, reporting)
scripts/init_session.py — initialize a hyperagent session, scaffold the workspace
scripts/run_task.py — evaluate a task-agent variant and record metrics
scripts/log_variant.py — log evaluated record, decide disposition, update archive and reports
scripts/render_report.py — generate HTML report of the full evolutionary history
scripts/select_parent.py — select a parent from the archive for the next generation

Note: There is no generate_variant.py script — the meta-agent role (hypothesis generation and code modification) is performed by the LLM agent itself, not by a script.

All scripts are non-interactive, expose --help, emit structured JSON on stdout, and keep diagnostics on stderr.

Default Workflow

Initialize the session after defining the optimization target:

python3 scripts/init_session.py \
  --goal "Improve prompt accuracy on math benchmark" \
  --metric-name accuracy \
  --unit pct \
  --direction higher \
  --task-command ./task.sh \
  --checks-command ./checks.sh \
  --scope src/agent.py \
  --max-generations 50

Evaluate the baseline (generation 0):

python3 scripts/run_task.py \
  --id gen-000 \
  --hypothesis "Control: unmodified task agent" \
  --change-summary "No modifications" \
  --baseline \
  --output .hyperagent/gen-000.json

python3 scripts/log_variant.py --input .hyperagent/gen-000.json

Selection → Modification → Evaluation loop:

# Select a parent from the archive
python3 scripts/select_parent.py --output .hyperagent/parent.json

# Generate a variant (meta-agent proposes modifications)
# This is where YOU (the LLM agent) act as the meta-agent:
# - Read the parent's code and performance history
# - Hypothesize an improvement
# - Apply code modifications
# - Record what you changed and why

# Evaluate the variant
python3 scripts/run_task.py \
  --id gen-001 \
  --hypothesis "Add chain-of-thought prompting to improve reasoning" \
  --change-summary "Wrap task prompt in step-by-step reasoning template" \
  --parent gen-000 \
  --output .hyperagent/gen-001.json

python3 scripts/log_variant.py --input .hyperagent/gen-001.json

Render reports at any time:
```
python3 scripts/render_report.py
```

Up-Front Q&A

Before starting, gather or confirm:

Objective — what are we optimizing?
Primary metric — exact name, unit, direction (lower/higher)
Task command — the script that runs the task agent and emits METRIC name=value lines
Correctness gates — tests or checks that must pass for a variant to be kept
Scope — which files can the meta-agent modify?
Meta-scope — can the meta-agent modify its own strategy? (default: yes)
Generation budget — max generations before stopping
Minimum improvement threshold — default 1%

Workspace Setup

Prefer a dedicated worktree on a fresh branch:

git worktree add ../hyperagent-<goal>-<date> -b hyperagent/<goal>-<date>

Create:
- hyperagent.md — checked in, durable session brief with full evolutionary history
- task.sh — checked in, benchmark runner (emits METRIC name=value)
- checks.sh — checked in, correctness gates
- .hyperagent/ — local artifact directory, NOT checked in

Ensure artifacts stay untracked:

rg -qxF '.hyperagent/' .git/info/exclude || printf '\n.hyperagent/\n' >> .git/info/exclude

The Meta-Agent Role

You (the LLM) are the meta-agent. Your job each generation is:

Select parent — use scripts/select_parent.py or choose based on the archive
Analyze — read the parent's code, performance history, and past experiment outcomes
Hypothesize — propose a specific, testable modification with a causal theory for why it should help
Modify — apply code changes to the task agent (and optionally to your own strategy notes in hyperagent.md)
Evaluate — run scripts/run_task.py to measure the variant
Log — use scripts/log_variant.py to record the result and update the archive
Reflect — update hyperagent.md with what you learned

Meta-Level Self-Modification

The meta-agent can improve its own process by updating:

Strategy notes in hyperagent.md (hypothesis generation patterns, evaluation heuristics)
Memory entries in .hyperagent/memory.jsonl (qualitative insights, correction plans)
The task evaluation protocol (adding secondary metrics, changing trial counts)

These meta-improvements compound across generations and transfer to new tasks.

Required Files

`hyperagent.md`

The durable contract and evolutionary history. A fresh agent can resume from this.

# Hyperagent: <goal>

## Objective
<What is being optimized and why.>

## Configuration
- Primary metric:
- Unit:
- Direction:
- Minimum improvement: X%
- Task command:
- Correctness gates:
- Generation budget:

## Scope
- Task agent files:
- Meta-agent can self-modify: yes/no

## Archive
`.hyperagent/archive.jsonl`

## Lineage
<Tree showing parent→child relationships and which variants were kept>

## Meta-Strategy
<Current approach to hypothesis generation — updated as the meta-agent learns>

## What We've Learned
<Key wins, dead ends, transferable insights>

## Performance Tracking
<Best variant, improvement trajectory, current plateau status>

`task.sh`

Bash script that runs the task agent and emits METRIC name=value lines:

#!/bin/bash
set -euo pipefail
# Run the task agent
python3 src/agent.py --input data/test.json 2>/dev/null
# The agent script should emit: METRIC accuracy=0.85

Archive Structure

The archive (.hyperagent/archive.jsonl) stores every variant ever evaluated:

{
  "id": "gen-007",
  "generation": 7,
  "parent_id": "gen-003",
  "timestamp": "2026-03-27T20:00:00Z",
  "hypothesis": "Add few-shot examples to improve pattern recognition",
  "change_summary": "Inserted 3 domain-specific examples into the task prompt",
  "files_touched": ["src/agent.py"],
  "metric_name": "accuracy",
  "direction": "higher",
  "warmup_trials": [0.82, 0.83],
  "measured_trials": [0.85, 0.86, 0.84, 0.85, 0.87],
  "summary": {"median": 0.85, "mean": 0.854, "min": 0.84, "max": 0.87},
  "checks": "passed",
  "disposition": "keep",
  "children_count": 0,
  "meta_modifications": ["Updated strategy notes with few-shot pattern"],
  "reason": "Improved by 3.2% over parent gen-003 (0.824). Checks passed."
}

Parent Selection

Selection probability for a parent is proportional to:

Performance score (higher is better for archive diversity)
Inverse of children count (favor unexplored high-performers)

This balances exploitation (good variants) with exploration (understudied variants).

python3 scripts/select_parent.py
# Output: {"selected_parent": "gen-003", "score": 0.824, "children": 1, "reason": "High performer with few children"}

Decision Rules

keep — variant beats current best by ≥ threshold, checks pass
discard — variant is worse, equal, or improvement below threshold
checks_failed — metric improved but correctness gates failed
crash — variant could not be evaluated

Plateau Detection

Track improvement velocity. Stop or pivot when:

3+ consecutive generations with no improvement
Hypothesis diversity drops (recycling ideas)
Improvement velocity < 0.1% per generation over last 5

Loop Behavior

Run autonomously until:

Generation budget exhausted
Plateau detected (3 consecutive non-improvements)
All promising hypotheses explored
User interrupts

During the loop:

One hypothesis per generation
Record dead ends explicitly
Keep the worktree clean between variants (revert discarded changes)
Update hyperagent.md after every generation

Common Pitfalls

1. Meta-Agent Overfitting Its Own Strategy

Symptom: Meta-strategy becomes over-specialized to early successes Fix: Periodically review and broaden the strategy; try categorically different approaches

2. Archive Bloat

Symptom: Archive grows large, selection becomes slow Fix: Archive old generations after 50 variants; maintain a compact summary

3. Self-Modification Destabilizing the Loop

Symptom: Meta-agent modifies evaluation or logging in ways that break the loop Fix: Keep outer-loop scripts (init, run, log, select) immutable. Only modify task code and strategy notes.

4. Hypothesis Recycling

Symptom: Later generations retry earlier failed ideas Fix: Always read .hyperagent/memory.jsonl before proposing. Explicitly check against dead ends.

Transfer Protocol

To transfer meta-improvements to a new domain:

Extract meta-strategy from hyperagent.md "What We've Learned" section
Copy .hyperagent/memory.jsonl as starting knowledge
Initialize new session with transferred strategy as initial context
The meta-agent starts with accumulated wisdom instead of from scratch

Report Generation

python3 scripts/render_report.py

Generates .hyperagent/report.html with:

Lineage tree visualization
Performance over generations
Best-so-far trend
Disposition breakdown
Per-variant trial distributions
Meta-strategy evolution timeline

Adoption

ckorhonen/hyperagent

$ install --global

Security Scan Results

SKILL.md

Hyperagent

Quick Start — Simple Examples

How It Works

Core Principles

Available Scripts

Default Workflow

Up-Front Q&A

Workspace Setup

The Meta-Agent Role

Meta-Level Self-Modification

Required Files

hyperagent.md

task.sh

Archive Structure

Parent Selection

Decision Rules

Plateau Detection

Loop Behavior

Common Pitfalls

1. Meta-Agent Overfitting Its Own Strategy

2. Archive Bloat

3. Self-Modification Destabilizing the Loop

4. Hypothesis Recycling

Transfer Protocol

Report Generation

Related Skills

ckorhonen/idea-md

ckorhonen/writing-plans

ckorhonen/video-editor

ckorhonen/ui-design

ckorhonen/hyperagent

$ install --global

Security Scan Results

SKILL.md

Hyperagent

Quick Start — Simple Examples

How It Works

Core Principles

Available Scripts

Default Workflow

Up-Front Q&A

Workspace Setup

The Meta-Agent Role

Meta-Level Self-Modification

Required Files

hyperagent.md

task.sh

Archive Structure

Parent Selection

Decision Rules

Plateau Detection

Loop Behavior

Common Pitfalls

1. Meta-Agent Overfitting Its Own Strategy

2. Archive Bloat

3. Self-Modification Destabilizing the Loop

4. Hypothesis Recycling

Transfer Protocol

Report Generation

Related Skills

ckorhonen/idea-md

ckorhonen/writing-plans

ckorhonen/video-editor

ckorhonen/ui-design

`hyperagent.md`

`task.sh`

`hyperagent.md`

`task.sh`