Benchmark Memory System

Systematic benchmarking framework to measure retrieval quality, compare configurations, and identify optimal parameters for the Local Brain Search memory system.

Purpose

Measure retrieval quality objectively using LLM-as-judge scoring
Compare different configuration settings (spreading vs static, parameter sweeps)
Identify optimal parameters for different query types (factual, conceptual, synthesis)
Generate reproducible results against frozen test datasets

Design Principles

Contained: Skill + sub-agent + bundled scripts
Reproducible: Test against frozen Brain snapshot
Automated: LLM-as-judge for relevance scoring
Analyzable: CSV output for analysis

State Dependencies

| Source | Location | Read | Write | |--------|----------|------|-------| | Brain snapshot | .claude/skills/benchmark-memory/snapshots/ | Yes | Yes | | Query sets | .claude/skills/benchmark-memory/query-sets/ | Yes | Yes | | Benchmark results | .claude/skills/benchmark-memory/results/ | Yes | Yes | | Analysis reports | .claude/skills/benchmark-memory/analysis/ | No | Yes | | Memory system | resources/local-brain-search/ | Yes | No |

Prerequisites

Local Brain Search system indexed (resources/local-brain-search/data/brain.faiss)
Python venv at resources/local-brain-search/venv/ with search dependencies
Claude Code CLI installed and authenticated (for LLM-as-judge scoring via headless mode)

LLM-as-Judge Scoring

This skill uses Claude Code headless mode (claude -p) for LLM relevance scoring, not a separate API key. This means:

No ANTHROPIC_API_KEY environment variable needed
Uses your existing Claude Code authentication
Default model: sonnet (good quality) - can also use haiku (faster/cheaper) or opus
JSON output via prompt engineering for reliable scoring

To verify Claude Code is available:

claude --version

Installing Dependencies

Dependencies are installed in the local-brain-search venv:

cd resources/local-brain-search
source venv/bin/activate
pip install pandas tqdm  # anthropic not required - uses Claude Code headless

Sub-Commands

`/benchmark-memory setup`

Create a frozen Brain snapshot and build its index.

cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --list-snapshots  # Check existing
python3 create_snapshot.py           # Create new snapshot

What it does:

Creates snapshot directory with date stamp
Copies Brain folder (excluding .obsidian, .trash)
Builds FAISS index for the snapshot
Creates SNAPSHOT-INFO.md with metadata

`/benchmark-memory create-queries [--count N]`

Generate or manage test query sets.

cd .claude/skills/benchmark-memory/scripts
python build_query_set.py --count 50 --output ../query-sets/core-50.json

Query Categories (50 total): | Category | Count | Example | |----------|-------|---------| | Factual | 10 | "What is dopamine?" | | Conceptual | 10 | "How does motivation work?" | | Synthesis | 15 | "Connect Buddhism and neuroscience" | | Temporal | 5 | "Recent notes about AI agents" | | Needle | 5 | "Note about intermittent reinforcement" | | Broad | 5 | "Identity" |

`/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]`

Execute benchmark with specified configuration.

cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --config focused --snapshot brain-snapshot-2026-02-18
./run_benchmark.sh --dry-run --config focused  # Preview without execution
./run_benchmark.sh --list-configs              # List available configs

Configurations:

focused: 15 key configurations (recommended for initial benchmarking)
single:CONFIG_NAME: Run single configuration
all: Full parameter sweep (expensive)

Estimated cost per run:

50 queries x 10 results x 15 configs = 7,500 LLM scores
Using Sonnet (default): ~$75
Using Haiku: ~$7.50

`/benchmark-memory analyze [--results FILE]`

Generate analysis summary from benchmark results.

cd .claude/skills/benchmark-memory/scripts
python analyze_results.py --results ../results/benchmark-*.csv

Outputs:

Summary by configuration
Summary by query category
Best config per intent
Recommendations

Workflow

Step 1: Setup (one-time)
/benchmark-memory setup
    |
    v
Step 2: Create Queries (one-time)
/benchmark-memory create-queries --count 50
    |
    v
Step 3: Run Benchmark (per experiment)
/benchmark-memory run --config focused
    |
    v
Step 4: Analyze Results
/benchmark-memory analyze

Metrics Collected

Performance Metrics

| Metric | Description | |--------|-------------| | latency_ms | Query execution time | | iterations | Spreading iterations used | | converged | Whether spreading converged |

Quality Metrics

| Metric | Range | Description | |--------|-------|-------------| | Precision@K | 0-1 | Fraction of results that are relevant | | Recall@K | 0-1 | Fraction of relevant notes found | | MRR | 0-1 | Mean Reciprocal Rank | | NDCG@K | 0-1 | Ranking quality with position discount | | Avg Score | 0-3 | Average LLM relevance score |

LLM-as-Judge Scoring Scale

| Score | Label | Definition | |-------|-------|------------| | 0 | Irrelevant | No connection to query | | 1 | Tangential | Loosely related | | 2 | Relevant | Addresses the query | | 3 | Highly Relevant | Directly answers the query |

Configurations to Test

Baseline

static_baseline: Traditional vector search
spreading_default: Spreading activation with defaults

Parameter Sweeps

Iteration count: 2, 5, 7
Inhibition strength: 0.1, 0.3, 0.5
Temporal decay: 0.8, 0.9, 0.95
Q-weight: 0.0, 0.3, 0.5

Optimized Combinations

synthesis_optimized: max_iterations=7, inhibition=0.1
factual_optimized: max_iterations=2, inhibition=0.5
balanced_optimized: max_iterations=5, inhibition=0.2, decay=0.85

Output Files

Results CSV

results/benchmark-YYYY-MM-DD-HHMMSS.csv

Schema:

timestamp,config_name,query_id,query_category,mode,max_iterations,
inhibition_strength,latency_ms,result_1_note,result_1_score,...,
precision_at_5,precision_at_10,mrr,ndcg_at_10,avg_score

Analysis Report

analysis/report-YYYY-MM-DD.md

Error Recovery

Partial benchmark run

Results are appended incrementally. Resume by running with --resume:

python run_benchmark.py --config focused --resume

API rate limits

Built-in retry with exponential backoff. Adjust --delay if needed:

python run_benchmark.py --config focused --delay 1.0

Invalid snapshot

Re-create snapshot:

python create_snapshot.py --force

Success Criteria

[ ] Snapshot creation produces valid index
[ ] Query set covers all 6 categories (50+ queries)
[ ] LLM judge produces consistent scores (>80% agreement on re-run)
[ ] All 15 focused configs can be benchmarked
[ ] Results CSV is valid and analyzable
[ ] Analysis identifies best config per query type

Expected Insights

After running benchmarks, answer these questions:

Does spreading beat static? For which query types?
What's the optimal iteration count for synthesis queries?
Does high inhibition help factual queries?
Does q_weight > 0 improve results over time?
What settings work best for each query type?

Cost Management

| Strategy | Savings | Trade-off | |----------|---------|-----------| | Score top 5 only | 50% | Less data on long-tail | | Use Haiku judge | 90% | Slightly less accurate | | Cache scores | Variable | Only for unchanged retrieval |

Recommendation: Start with Haiku judge, validate sample against Sonnet.

Directory Structure

.claude/skills/benchmark-memory/
├── SKILL.md                    # This file
├── requirements.txt            # Python dependencies
├── scripts/
│   ├── run_benchmark.sh        # Wrapper script (uses venv Python)
│   ├── create_snapshot.py      # Create frozen Brain snapshot
│   ├── build_query_set.py      # Generate/manage query test set
│   ├── run_benchmark.py        # Execute benchmark with config
│   ├── score_results.py        # LLM-as-judge scoring
│   ├── compute_metrics.py      # Calculate evaluation metrics
│   └── analyze_results.py      # Generate analysis summary
├── configs/
│   ├── focused_configs.json    # Test configurations (15 configs)
│   └── judge_prompt.txt        # LLM judge prompt template
├── snapshots/                  # Frozen Brain copies
├── query-sets/                 # Test queries (core-50.json included)
├── results/                    # Benchmark CSVs
└── analysis/                   # Analysis reports

Tested Components

| Component | Status | Notes | |-----------|--------|-------| | Snapshot creation | ✅ Works | Creates snapshot with FAISS index + graph | | Query set | ✅ Works | 50 queries across 6 categories | | Static search | ✅ Works | Traditional vector similarity | | Spreading search | ✅ Works | Multi-iteration activation | | 15 configs | ✅ Works | Focused parameter sweep | | LLM-as-judge | ✅ Works | Uses Claude Code headless mode (claude -p) | | Results CSV | Ready | Incremental writes, resume support |

Benchmark Memory System

Systematic benchmarking framework to measure retrieval quality, compare configurations, and identify optimal parameters for the Local Brain Search memory system.

Purpose

Measure retrieval quality objectively using LLM-as-judge scoring
Compare different configuration settings (spreading vs static, parameter sweeps)
Identify optimal parameters for different query types (factual, conceptual, synthesis)
Generate reproducible results against frozen test datasets

Design Principles

Contained: Skill + sub-agent + bundled scripts
Reproducible: Test against frozen Brain snapshot
Automated: LLM-as-judge for relevance scoring
Analyzable: CSV output for analysis

State Dependencies

Prerequisites

Local Brain Search system indexed (resources/local-brain-search/data/brain.faiss)
Python venv at resources/local-brain-search/venv/ with search dependencies
Claude Code CLI installed and authenticated (for LLM-as-judge scoring via headless mode)

LLM-as-Judge Scoring

This skill uses Claude Code headless mode (claude -p) for LLM relevance scoring, not a separate API key. This means:

No ANTHROPIC_API_KEY environment variable needed
Uses your existing Claude Code authentication
Default model: sonnet (good quality) - can also use haiku (faster/cheaper) or opus
JSON output via prompt engineering for reliable scoring

To verify Claude Code is available:

claude --version

Installing Dependencies

Dependencies are installed in the local-brain-search venv:

cd resources/local-brain-search
source venv/bin/activate
pip install pandas tqdm  # anthropic not required - uses Claude Code headless

Sub-Commands

`/benchmark-memory setup`

Create a frozen Brain snapshot and build its index.

cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --list-snapshots  # Check existing
python3 create_snapshot.py           # Create new snapshot

What it does:

Creates snapshot directory with date stamp
Copies Brain folder (excluding .obsidian, .trash)
Builds FAISS index for the snapshot
Creates SNAPSHOT-INFO.md with metadata

`/benchmark-memory create-queries [--count N]`

Generate or manage test query sets.

cd .claude/skills/benchmark-memory/scripts
python build_query_set.py --count 50 --output ../query-sets/core-50.json

`/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]`

Execute benchmark with specified configuration.

cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --config focused --snapshot brain-snapshot-2026-02-18
./run_benchmark.sh --dry-run --config focused  # Preview without execution
./run_benchmark.sh --list-configs              # List available configs

Configurations:

focused: 15 key configurations (recommended for initial benchmarking)
single:CONFIG_NAME: Run single configuration
all: Full parameter sweep (expensive)

Estimated cost per run:

50 queries x 10 results x 15 configs = 7,500 LLM scores
Using Sonnet (default): ~$75
Using Haiku: ~$7.50

`/benchmark-memory analyze [--results FILE]`

Generate analysis summary from benchmark results.

cd .claude/skills/benchmark-memory/scripts
python analyze_results.py --results ../results/benchmark-*.csv

Outputs:

Summary by configuration
Summary by query category
Best config per intent
Recommendations

Workflow

Step 1: Setup (one-time)
/benchmark-memory setup
    |
    v
Step 2: Create Queries (one-time)
/benchmark-memory create-queries --count 50
    |
    v
Step 3: Run Benchmark (per experiment)
/benchmark-memory run --config focused
    |
    v
Step 4: Analyze Results
/benchmark-memory analyze

Metrics Collected

Performance Metrics

| Metric | Description | |--------|-------------| | latency_ms | Query execution time | | iterations | Spreading iterations used | | converged | Whether spreading converged |

Quality Metrics

LLM-as-Judge Scoring Scale

Configurations to Test

Baseline

static_baseline: Traditional vector search
spreading_default: Spreading activation with defaults

Parameter Sweeps

Iteration count: 2, 5, 7
Inhibition strength: 0.1, 0.3, 0.5
Temporal decay: 0.8, 0.9, 0.95
Q-weight: 0.0, 0.3, 0.5

Optimized Combinations

synthesis_optimized: max_iterations=7, inhibition=0.1
factual_optimized: max_iterations=2, inhibition=0.5
balanced_optimized: max_iterations=5, inhibition=0.2, decay=0.85

Output Files

Results CSV

results/benchmark-YYYY-MM-DD-HHMMSS.csv

Schema:

timestamp,config_name,query_id,query_category,mode,max_iterations,
inhibition_strength,latency_ms,result_1_note,result_1_score,...,
precision_at_5,precision_at_10,mrr,ndcg_at_10,avg_score

Analysis Report

analysis/report-YYYY-MM-DD.md

Error Recovery

Partial benchmark run

Results are appended incrementally. Resume by running with --resume:

python run_benchmark.py --config focused --resume

API rate limits

Built-in retry with exponential backoff. Adjust --delay if needed:

python run_benchmark.py --config focused --delay 1.0

Invalid snapshot

Re-create snapshot:

python create_snapshot.py --force

Success Criteria

[ ] Snapshot creation produces valid index
[ ] Query set covers all 6 categories (50+ queries)
[ ] LLM judge produces consistent scores (>80% agreement on re-run)
[ ] All 15 focused configs can be benchmarked
[ ] Results CSV is valid and analyzable
[ ] Analysis identifies best config per query type

Expected Insights

After running benchmarks, answer these questions:

Does spreading beat static? For which query types?
What's the optimal iteration count for synthesis queries?
Does high inhibition help factual queries?
Does q_weight > 0 improve results over time?
What settings work best for each query type?

Cost Management

Recommendation: Start with Haiku judge, validate sample against Sonnet.

Directory Structure

.claude/skills/benchmark-memory/
├── SKILL.md                    # This file
├── requirements.txt            # Python dependencies
├── scripts/
│   ├── run_benchmark.sh        # Wrapper script (uses venv Python)
│   ├── create_snapshot.py      # Create frozen Brain snapshot
│   ├── build_query_set.py      # Generate/manage query test set
│   ├── run_benchmark.py        # Execute benchmark with config
│   ├── score_results.py        # LLM-as-judge scoring
│   ├── compute_metrics.py      # Calculate evaluation metrics
│   └── analyze_results.py      # Generate analysis summary
├── configs/
│   ├── focused_configs.json    # Test configurations (15 configs)
│   └── judge_prompt.txt        # LLM judge prompt template
├── snapshots/                  # Frozen Brain copies
├── query-sets/                 # Test queries (core-50.json included)
├── results/                    # Benchmark CSVs
└── analysis/                   # Analysis reports

Adoption

abilityai/benchmark-memory

$ install --global

Security Scan Results

SKILL.md

Benchmark Memory System

Purpose

Design Principles

State Dependencies

Prerequisites

LLM-as-Judge Scoring

Installing Dependencies

Sub-Commands

/benchmark-memory setup

/benchmark-memory create-queries [--count N]

/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]

/benchmark-memory analyze [--results FILE]

Workflow

Metrics Collected

Performance Metrics

Quality Metrics

LLM-as-Judge Scoring Scale

Configurations to Test

Baseline

Parameter Sweeps

Optimized Combinations

Output Files

Results CSV

Analysis Report

Error Recovery

Partial benchmark run

API rate limits

Invalid snapshot

Success Criteria

Expected Insights

Cost Management

Directory Structure

Tested Components

Related Skills

abilityai/update-voice-prompt

abilityai/update-dashboard

abilityai/update-changelog

abilityai/test-memory-system

abilityai/benchmark-memory

$ install --global

Security Scan Results

SKILL.md

Benchmark Memory System

Purpose

Design Principles

State Dependencies

Prerequisites

LLM-as-Judge Scoring

Installing Dependencies

Sub-Commands

/benchmark-memory setup

/benchmark-memory create-queries [--count N]

/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]

/benchmark-memory analyze [--results FILE]

Workflow

Metrics Collected

Performance Metrics

Quality Metrics

LLM-as-Judge Scoring Scale

Configurations to Test

Baseline

Parameter Sweeps

Optimized Combinations

Output Files

Results CSV

Analysis Report

Error Recovery

Partial benchmark run

API rate limits

Invalid snapshot

Success Criteria

Expected Insights

Cost Management

Directory Structure

Tested Components

`/benchmark-memory setup`

`/benchmark-memory create-queries [--count N]`

`/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]`

`/benchmark-memory analyze [--results FILE]`

`/benchmark-memory setup`

`/benchmark-memory create-queries [--count N]`

`/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]`

`/benchmark-memory analyze [--results FILE]`