.claude/skills/benchmark-memory/SKILL.md
Systematic benchmarking framework for Local Brain Search memory system with LLM-as-judge scoring
npx skillsauth add abilityai/cornelius benchmark-memoryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic benchmarking framework to measure retrieval quality, compare configurations, and identify optimal parameters for the Local Brain Search memory system.
| Source | Location | Read | Write |
|--------|----------|------|-------|
| Brain snapshot | .claude/skills/benchmark-memory/snapshots/ | Yes | Yes |
| Query sets | .claude/skills/benchmark-memory/query-sets/ | Yes | Yes |
| Benchmark results | .claude/skills/benchmark-memory/results/ | Yes | Yes |
| Analysis reports | .claude/skills/benchmark-memory/analysis/ | No | Yes |
| Memory system | resources/local-brain-search/ | Yes | No |
resources/local-brain-search/data/brain.faiss)resources/local-brain-search/venv/ with search dependenciesThis skill uses Claude Code headless mode (claude -p) for LLM relevance scoring, not a separate API key. This means:
ANTHROPIC_API_KEY environment variable neededsonnet (good quality) - can also use haiku (faster/cheaper) or opusTo verify Claude Code is available:
claude --version
Dependencies are installed in the local-brain-search venv:
cd resources/local-brain-search
source venv/bin/activate
pip install pandas tqdm # anthropic not required - uses Claude Code headless
/benchmark-memory setupCreate a frozen Brain snapshot and build its index.
cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --list-snapshots # Check existing
python3 create_snapshot.py # Create new snapshot
What it does:
/benchmark-memory create-queries [--count N]Generate or manage test query sets.
cd .claude/skills/benchmark-memory/scripts
python build_query_set.py --count 50 --output ../query-sets/core-50.json
Query Categories (50 total): | Category | Count | Example | |----------|-------|---------| | Factual | 10 | "What is dopamine?" | | Conceptual | 10 | "How does motivation work?" | | Synthesis | 15 | "Connect Buddhism and neuroscience" | | Temporal | 5 | "Recent notes about AI agents" | | Needle | 5 | "Note about intermittent reinforcement" | | Broad | 5 | "Identity" |
/benchmark-memory run [--config CONFIG] [--snapshot SNAPSHOT]Execute benchmark with specified configuration.
cd .claude/skills/benchmark-memory/scripts
./run_benchmark.sh --config focused --snapshot brain-snapshot-2026-02-18
./run_benchmark.sh --dry-run --config focused # Preview without execution
./run_benchmark.sh --list-configs # List available configs
Configurations:
focused: 15 key configurations (recommended for initial benchmarking)single:CONFIG_NAME: Run single configurationall: Full parameter sweep (expensive)Estimated cost per run:
/benchmark-memory analyze [--results FILE]Generate analysis summary from benchmark results.
cd .claude/skills/benchmark-memory/scripts
python analyze_results.py --results ../results/benchmark-*.csv
Outputs:
Step 1: Setup (one-time)
/benchmark-memory setup
|
v
Step 2: Create Queries (one-time)
/benchmark-memory create-queries --count 50
|
v
Step 3: Run Benchmark (per experiment)
/benchmark-memory run --config focused
|
v
Step 4: Analyze Results
/benchmark-memory analyze
| Metric | Description |
|--------|-------------|
| latency_ms | Query execution time |
| iterations | Spreading iterations used |
| converged | Whether spreading converged |
| Metric | Range | Description | |--------|-------|-------------| | Precision@K | 0-1 | Fraction of results that are relevant | | Recall@K | 0-1 | Fraction of relevant notes found | | MRR | 0-1 | Mean Reciprocal Rank | | NDCG@K | 0-1 | Ranking quality with position discount | | Avg Score | 0-3 | Average LLM relevance score |
| Score | Label | Definition | |-------|-------|------------| | 0 | Irrelevant | No connection to query | | 1 | Tangential | Loosely related | | 2 | Relevant | Addresses the query | | 3 | Highly Relevant | Directly answers the query |
static_baseline: Traditional vector searchspreading_default: Spreading activation with defaultssynthesis_optimized: max_iterations=7, inhibition=0.1factual_optimized: max_iterations=2, inhibition=0.5balanced_optimized: max_iterations=5, inhibition=0.2, decay=0.85results/benchmark-YYYY-MM-DD-HHMMSS.csv
Schema:
timestamp,config_name,query_id,query_category,mode,max_iterations,
inhibition_strength,latency_ms,result_1_note,result_1_score,...,
precision_at_5,precision_at_10,mrr,ndcg_at_10,avg_score
analysis/report-YYYY-MM-DD.md
Results are appended incrementally. Resume by running with --resume:
python run_benchmark.py --config focused --resume
Built-in retry with exponential backoff. Adjust --delay if needed:
python run_benchmark.py --config focused --delay 1.0
Re-create snapshot:
python create_snapshot.py --force
After running benchmarks, answer these questions:
| Strategy | Savings | Trade-off | |----------|---------|-----------| | Score top 5 only | 50% | Less data on long-tail | | Use Haiku judge | 90% | Slightly less accurate | | Cache scores | Variable | Only for unchanged retrieval |
Recommendation: Start with Haiku judge, validate sample against Sonnet.
.claude/skills/benchmark-memory/
├── SKILL.md # This file
├── requirements.txt # Python dependencies
├── scripts/
│ ├── run_benchmark.sh # Wrapper script (uses venv Python)
│ ├── create_snapshot.py # Create frozen Brain snapshot
│ ├── build_query_set.py # Generate/manage query test set
│ ├── run_benchmark.py # Execute benchmark with config
│ ├── score_results.py # LLM-as-judge scoring
│ ├── compute_metrics.py # Calculate evaluation metrics
│ └── analyze_results.py # Generate analysis summary
├── configs/
│ ├── focused_configs.json # Test configurations (15 configs)
│ └── judge_prompt.txt # LLM judge prompt template
├── snapshots/ # Frozen Brain copies
├── query-sets/ # Test queries (core-50.json included)
├── results/ # Benchmark CSVs
└── analysis/ # Analysis reports
| Component | Status | Notes |
|-----------|--------|-------|
| Snapshot creation | ✅ Works | Creates snapshot with FAISS index + graph |
| Query set | ✅ Works | 50 queries across 6 categories |
| Static search | ✅ Works | Traditional vector similarity |
| Spreading search | ✅ Works | Multi-iteration activation |
| 15 configs | ✅ Works | Focused parameter sweep |
| LLM-as-judge | ✅ Works | Uses Claude Code headless mode (claude -p) |
| Results CSV | Ready | Incremental writes, resume support |
development
Rebuild the Cornelius voice agent system prompt from knowledge base sources
data-ai
Update dashboard.yaml with current knowledge base metrics from analysis report
documentation
Update Knowledge Graph Changelog
testing
Comprehensive testing playbook for Local Brain Search memory improvements (Phases 1, 3, 4)