bm25/SKILL.md
Ranked content search over any text corpus using BM25 (via xhluca/bm25s). Corpus-agnostic: works on cloned repos, project knowledge stores, uploaded files/archives, and any local directory. Stateless — builds an in-memory index each invocation, no cache, no persistence. Use when you need ranked multi-word content search beyond grep, or when picking the "most relevant files for these terms" across a corpus. Triggers on "rank these documents", "search this corpus", "find content about X", "which files are most about Y", or multi-word concept queries against a known body of text.
npx skillsauth add oaustegard/claude-skills bm25Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Ranked content search over any text corpus. One CLI, in-memory BM25 index per process, with a session-local disk cache so repeat invocations against the same corpus load in tens of milliseconds instead of rebuilding.
uv pip install --system --break-system-packages bm25s
Install is sub-second on a warm uv cache. That's the entire dependency.
BM25=/mnt/skills/user/bm25/scripts/bm25.py
# Local directory
python3 $BM25 ./repo 'csrf middleware'
# Multiple queries against the same in-memory index (build once, query many)
python3 $BM25 ./repo 'csrf middleware' 'session backend' 'queryset filter'
# Cloned GitHub repo via tarball (one HTTP call)
python3 $BM25 'github.com/django/django' 'atomic transaction'
python3 $BM25 'github.com/django/django@stable/5.0.x' 'atomic transaction'
# Project knowledge or uploads
python3 $BM25 project 'RAG scaling laws'
python3 $BM25 uploads 'tax loss harvesting'
# Filters
python3 $BM25 ./repo 'auth flow' --exclude 'tests/*' --exclude '*/tests/*'
python3 $BM25 ./repo 'config' --include '*.py' --include '*.toml'
# Interactive (REPL — single corpus, many queries)
python3 $BM25 ./repo --interactive
# JSON output for piping
python3 $BM25 ./repo 'auth flow' --json
| Spec | Meaning |
|------|---------|
| ./path or /abs/path | Local directory |
| uploads | /mnt/user-data/uploads/ |
| project | /mnt/project/ |
| github.com/owner/repo[@ref] | Tarball fetch via GitHub API (GH_TOKEN used if set) |
| Option | Default | Description |
|--------|---------|-------------|
| --top-k N | 10 | Results per query |
| --include GLOB | (auto) | Repeatable. If set, only files matching one of these globs are indexed |
| --exclude GLOB | | Repeatable. Skip files matching these globs |
| --snippet-lines N | 3 | Lines of snippet context per hit (0 = none) |
| --max-file-bytes N | 2,000,000 | Skip files larger than this |
| --json | | Machine-readable output |
| --interactive / -i | | REPL mode for ad-hoc querying within one session |
| --stats | | Print discover + index timings as JSON |
| --no-cache | | Bypass the session-local index cache; build in-memory only |
With no --include, a default set of text/code extensions is indexed (Python,
JS/TS, Go, Rust, Markdown, JSON, YAML, etc.). Standard noise dirs are skipped
unconditionally: .git, node_modules, __pycache__, .venv, dist, etc.
| Question shape | Tool |
|----------------|------|
| "Find lines matching class.*Error" | grep / ripgrep |
| "Show me where parse_input is defined" | tree-sitting (find:/source:) |
| "Which files are about CSRF handling?" | bm25 |
| "Rank these docs by relevance to 'rate limiting strategies'" | bm25 |
| "What's the implementation of the atomic transaction context manager?" | bm25, then tree-sitting source: |
| "Find code by natural-language concept (in a code repo)" | searching-codebases (which has its own TF-IDF mode) |
The boundary with searching-codebases: that skill is code-specific (routes
between regex and TF-IDF, expands via tree-sitting AST). bm25 is the simpler
general-purpose tool — any corpus, no AST awareness, no routing. Prefer
searching-codebases for code; reach for bm25 when the corpus is mixed
(docs + code), non-code (notes, transcripts, PDFs converted to text), or when
you specifically want BM25's length-normalized scoring.
/home/claude/.bm25-cache/<key>/. The
key is a hash of (resolved_corpus_path, include_globs, exclude_globs, max_file_bytes) — any change invalidates naturally. First invocation
builds and saves; subsequent invocations against the same corpus and
filters load in tens of milliseconds. The cache lives in /home/claude,
which is ephemeral, so it expires at the session boundary — same
lifetime as the corpus state itself, no cross-session staleness.
~5–35MB per cached index, depending on corpus size.--no-cache bypasses both load and save — useful only if you've
mutated the corpus mid-session (rare) or want to confirm a rebuild matches.--interactive, amortizes any rebuild cost across queries.tree-sitting queries on the same paths.bm25s.tokenize with stopwords disabled — over a
small Django sample, AST-derived token streams (identifiers/strings/
comments only) gave near-identical rankings, so we don't bother.Default (human-readable):
QUERY: csrf middleware
----------------------------------------------------------------------
1. 5.51 django/core/checks/security/csrf.py
def _csrf_middleware():
return "django.middleware.csrf.CsrfViewMiddleware" in settings.MIDDLEWARE
2. 5.34 docs/howto/csrf.txt
...
--json produces {"query": ..., "results": [{"path", "score", "snippet"}, ...]}.
bm25.py CLI
├── resolve_corpus(spec) → local Path (downloads tarball if github.com/...)
├── cache_key(...) → 16-hex sha256 of inputs that determine the index
├── CorpusIndex.load(cache_dir) → returns cached index if present, else None
├── CorpusIndex.build(...) → walks files, tokenizes, indexes with bm25s
├── CorpusIndex.save(cache_dir) → persists to /home/claude/.bm25-cache/<key>/
├── query(q, k) → ranked (doc_idx, score) pairs
└── best_snippet(doc, q, lines) → pick line w/ most query-term hits + context
Cache contents per directory:
bm25/ — bm25s.BM25.save() output (NumPy arrays + vocab)corpus.pkl — pickled {paths, docs} so we can render snippets without
re-reading the source filesmanifest.json — corpus root, files count, built_at timestampNo network beyond optional tarball fetch on github.com/... corpora. No
state outside /home/claude/, which is ephemeral.
testing
Disciplined, validation-gated revision of an EXISTING skill so each edit is a measured improvement rather than a guess. Use when editing, revising, or tuning a skill that already exists and there is evidence it underperforms (observed failures, drift, complaints) — invoke by name, or have versioning-skills / creating-skill defer to it before applying edits. Not for authoring a brand-new skill from scratch (use creating-skill) or one-off prose.
development
Skill-aware orchestration with context routing. Decomposes complex tasks into skill-typed subtasks, extracts targeted context subsets, executes subagents in parallel, and synthesizes results. Self-answers trivial lookups inline. No SDK dependency — uses raw HTTP via httpx. Use when tasks require multiple analytical perspectives, when context is large and subtasks only need portions, or when orchestrating-agents spawns too many redundant subagents.
tools
Orchestrates parallel API instances, delegated sub-tasks, and multi-agent workflows with streaming and tool-enabled delegation patterns. Use for parallel analysis, multi-perspective reviews, or complex task decomposition.
development
Invokes Google Gemini models for structured outputs, image generation, multi-modal tasks, and Google-specific features. Use when users request Gemini, image generation, structured JSON output, Google API integration, or cost-effective parallel processing.