Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

oaustegard/bm25

Name: bm25
Author: oaustegard

bm25/SKILL.md

npx skillsauth add oaustegard/claude-skills bm25

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

bm25

Ranked content search over any text corpus. One CLI, in-memory BM25 index per process, with a session-local disk cache so repeat invocations against the same corpus load in tens of milliseconds instead of rebuilding.

Setup

uv pip install --system --break-system-packages bm25s

Install is sub-second on a warm uv cache. That's the entire dependency.

Usage

BM25=/mnt/skills/user/bm25/scripts/bm25.py

# Local directory
python3 $BM25 ./repo 'csrf middleware'

# Multiple queries against the same in-memory index (build once, query many)
python3 $BM25 ./repo 'csrf middleware' 'session backend' 'queryset filter'

# Cloned GitHub repo via tarball (one HTTP call)
python3 $BM25 'github.com/django/django' 'atomic transaction'
python3 $BM25 'github.com/django/django@stable/5.0.x' 'atomic transaction'

# Project knowledge or uploads
python3 $BM25 project 'RAG scaling laws'
python3 $BM25 uploads 'tax loss harvesting'

# Filters
python3 $BM25 ./repo 'auth flow' --exclude 'tests/*' --exclude '*/tests/*'
python3 $BM25 ./repo 'config' --include '*.py' --include '*.toml'

# Interactive (REPL — single corpus, many queries)
python3 $BM25 ./repo --interactive

# JSON output for piping
python3 $BM25 ./repo 'auth flow' --json

Corpus types

| Spec | Meaning | |------|---------| | ./path or /abs/path | Local directory | | uploads | /mnt/user-data/uploads/ | | project | /mnt/project/ | | github.com/owner/repo[@ref] | Tarball fetch via GitHub API (GH_TOKEN used if set) |

Options

| Option | Default | Description | |--------|---------|-------------| | --top-k N | 10 | Results per query | | --include GLOB | (auto) | Repeatable. If set, only files matching one of these globs are indexed | | --exclude GLOB | | Repeatable. Skip files matching these globs | | --snippet-lines N | 3 | Lines of snippet context per hit (0 = none) | | --max-file-bytes N | 2,000,000 | Skip files larger than this | | --json | | Machine-readable output | | --interactive / -i | | REPL mode for ad-hoc querying within one session | | --stats | | Print discover + index timings as JSON | | --no-cache | | Bypass the session-local index cache; build in-memory only |

With no --include, a default set of text/code extensions is indexed (Python, JS/TS, Go, Rust, Markdown, JSON, YAML, etc.). Standard noise dirs are skipped unconditionally: .git, node_modules, __pycache__, .venv, dist, etc.

When to use bm25

| Question shape | Tool | |----------------|------| | "Find lines matching class.*Error" | grep / ripgrep | | "Show me where parse_input is defined" | tree-sitting (find:/source:) | | "Which files are about CSRF handling?" | bm25 | | "Rank these docs by relevance to 'rate limiting strategies'" | bm25 | | "What's the implementation of the atomic transaction context manager?" | bm25, then tree-sitting source: | | "Find code by natural-language concept (in a code repo)" | searching-codebases (which has its own TF-IDF mode) |

The boundary with searching-codebases: that skill is code-specific (routes between regex and TF-IDF, expands via tree-sitting AST). bm25 is the simpler general-purpose tool — any corpus, no AST awareness, no routing. Prefer searching-codebases for code; reach for bm25 when the corpus is mixed (docs + code), non-code (notes, transcripts, PDFs converted to text), or when you specifically want BM25's length-normalized scoring.

Design notes

Session-local disk cache at /home/claude/.bm25-cache/<key>/. The key is a hash of (resolved_corpus_path, include_globs, exclude_globs, max_file_bytes) — any change invalidates naturally. First invocation builds and saves; subsequent invocations against the same corpus and filters load in tens of milliseconds. The cache lives in /home/claude, which is ephemeral, so it expires at the session boundary — same lifetime as the corpus state itself, no cross-session staleness. ~5–35MB per cached index, depending on corpus size.
--no-cache bypasses both load and save — useful only if you've mutated the corpus mid-session (rare) or want to confirm a rebuild matches.
Reuse within a single invocation. The retriever stays in memory between queries in one process. Passing multiple queries positionally, or using --interactive, amortizes any rebuild cost across queries.
No AST awareness. Chunking is per-file. For symbol-level results in code, combine with tree-sitting queries on the same paths.
Tokenizer. Default bm25s.tokenize with stopwords disabled — over a small Django sample, AST-derived token streams (identifiers/strings/ comments only) gave near-identical rankings, so we don't bother.

Output format

Default (human-readable):

QUERY: csrf middleware
----------------------------------------------------------------------
  1.   5.51  django/core/checks/security/csrf.py
    def _csrf_middleware():
        return "django.middleware.csrf.CsrfViewMiddleware" in settings.MIDDLEWARE
  2.   5.34  docs/howto/csrf.txt
    ...

--json produces {"query": ..., "results": [{"path", "score", "snippet"}, ...]}.

Architecture

bm25.py CLI
  ├── resolve_corpus(spec)         → local Path (downloads tarball if github.com/...)
  ├── cache_key(...)               → 16-hex sha256 of inputs that determine the index
  ├── CorpusIndex.load(cache_dir)  → returns cached index if present, else None
  ├── CorpusIndex.build(...)       → walks files, tokenizes, indexes with bm25s
  ├── CorpusIndex.save(cache_dir)  → persists to /home/claude/.bm25-cache/<key>/
  ├── query(q, k)                  → ranked (doc_idx, score) pairs
  └── best_snippet(doc, q, lines)  → pick line w/ most query-term hits + context

Cache contents per directory:

bm25/ — bm25s.BM25.save() output (NumPy arrays + vocab)
corpus.pkl — pickled {paths, docs} so we can render snippets without re-reading the source files
manifest.json — corpus root, files count, built_at timestamp

No network beyond optional tarball fetch on github.com/... corpora. No state outside /home/claude/, which is ephemeral.

oaustegard/bm25

bm25/SKILL.md

Ranked content search over any text corpus using BM25 (via xhluca/bm25s). Corpus-agnostic: works on cloned repos, project knowledge stores, uploaded files/archives, and any local directory. Stateless — builds an in-memory index each invocation, no cache, no persistence. Use when you need ranked multi-word content search beyond grep, or when picking the "most relevant files for these terms" across a corpus. Triggers on "rank these documents", "search this corpus", "find content about X", "which files are most about Y", or multi-word concept queries against a known body of text.

122 stars

development

Updated May 21, 2026

$ install --global

skillsauth

npx skillsauth add oaustegard/claude-skills bm25

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 21, 2026, 6:43 AM136.3s4 files scanned

SKILL.md

name:: bm25
description:: >-
Corpus-agnostic:: works on cloned repos, project knowledge stores, uploaded
version:: 0.1.2

bm25

Setup

uv pip install --system --break-system-packages bm25s

Install is sub-second on a warm uv cache. That's the entire dependency.

Usage

BM25=/mnt/skills/user/bm25/scripts/bm25.py

# Local directory
python3 $BM25 ./repo 'csrf middleware'

# Multiple queries against the same in-memory index (build once, query many)
python3 $BM25 ./repo 'csrf middleware' 'session backend' 'queryset filter'

# Cloned GitHub repo via tarball (one HTTP call)
python3 $BM25 'github.com/django/django' 'atomic transaction'
python3 $BM25 'github.com/django/django@stable/5.0.x' 'atomic transaction'

# Project knowledge or uploads
python3 $BM25 project 'RAG scaling laws'
python3 $BM25 uploads 'tax loss harvesting'

# Filters
python3 $BM25 ./repo 'auth flow' --exclude 'tests/*' --exclude '*/tests/*'
python3 $BM25 ./repo 'config' --include '*.py' --include '*.toml'

# Interactive (REPL — single corpus, many queries)
python3 $BM25 ./repo --interactive

# JSON output for piping
python3 $BM25 ./repo 'auth flow' --json

Corpus types

Options

When to use bm25

Design notes

Session-local disk cache at /home/claude/.bm25-cache/<key>/. The key is a hash of (resolved_corpus_path, include_globs, exclude_globs, max_file_bytes) — any change invalidates naturally. First invocation builds and saves; subsequent invocations against the same corpus and filters load in tens of milliseconds. The cache lives in /home/claude, which is ephemeral, so it expires at the session boundary — same lifetime as the corpus state itself, no cross-session staleness. ~5–35MB per cached index, depending on corpus size.
--no-cache bypasses both load and save — useful only if you've mutated the corpus mid-session (rare) or want to confirm a rebuild matches.
Reuse within a single invocation. The retriever stays in memory between queries in one process. Passing multiple queries positionally, or using --interactive, amortizes any rebuild cost across queries.
No AST awareness. Chunking is per-file. For symbol-level results in code, combine with tree-sitting queries on the same paths.
Tokenizer. Default bm25s.tokenize with stopwords disabled — over a small Django sample, AST-derived token streams (identifiers/strings/ comments only) gave near-identical rankings, so we don't bother.

Output format

Default (human-readable):

QUERY: csrf middleware
----------------------------------------------------------------------
  1.   5.51  django/core/checks/security/csrf.py
    def _csrf_middleware():
        return "django.middleware.csrf.CsrfViewMiddleware" in settings.MIDDLEWARE
  2.   5.34  docs/howto/csrf.txt
    ...

--json produces {"query": ..., "results": [{"path", "score", "snippet"}, ...]}.

Architecture

bm25.py CLI
  ├── resolve_corpus(spec)         → local Path (downloads tarball if github.com/...)
  ├── cache_key(...)               → 16-hex sha256 of inputs that determine the index
  ├── CorpusIndex.load(cache_dir)  → returns cached index if present, else None
  ├── CorpusIndex.build(...)       → walks files, tokenizes, indexes with bm25s
  ├── CorpusIndex.save(cache_dir)  → persists to /home/claude/.bm25-cache/<key>/
  ├── query(q, k)                  → ranked (doc_idx, score) pairs
  └── best_snippet(doc, q, lines)  → pick line w/ most query-term hits + context

Cache contents per directory:

bm25/ — bm25s.BM25.save() output (NumPy arrays + vocab)
corpus.pkl — pickled {paths, docs} so we can render snippets without re-reading the source files
manifest.json — corpus root, files count, built_at timestamp

No network beyond optional tarball fetch on github.com/... corpora. No state outside /home/claude/, which is ephemeral.

Related Skills

oaustegard/writing-instructions

development

VerifiedTrustedCommunity

Write effective instructions for Claude: project instructions, standalone prompts, and skill content. Use when users need help writing prompts, setting up project instructions, choosing between instruction formats, or improving how they communicate with Claude. Covers writing principles, model-aware calibration, and format selection. For building and testing complete skills, use skill-creator instead.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/writing-instructions

oaustegard/finding-skills

data-ai

VerifiedTrustedCommunity

Discover and load skills on demand from /mnt/skills/user/. Use when you need a capability but don't know which skill provides it, when the boot-emitted skill list is names-only and you need a full description, or when you want to list the catalog. Verbs are list (names only), search (rank by name/description match against a query), and show (emit the full SKILL.md for a named skill).

134SKILL.mdUpdated Jul 26, 2026

oaustegard/finding-skills

oaustegard/transcribing-images

documentation

VerifiedTrustedCommunity

Reads the visual content of slides, pages, and images the way a human would, not just their embedded text. Use when a PPTX or PDF has image slides, screenshots, charts, scanned figures, or flattened-to-image layouts that the built-in pptx/pdf skills read as empty; when asked to transcribe, describe, OCR, or extract what is shown in an image, slide deck, or document page; or when embedded-text extraction returned little or nothing from a visually rich file. Triggers on 'read this deck', 'what's on these slides', 'transcribe', 'OCR', 'extract text from image', 'describe this chart/diagram', .pptx/.pdf/.png/.jpg with visual content.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/transcribing-images

oaustegard/svg-portrait-mode

development

VerifiedTrustedCommunity

Portrait Mode for SVGs — foveated vectorization with 4-zone selective detail. Combines vision annotations, MediaPipe segmentation/landmarks, and optional saliency. Like phone portrait mode, but vectorized. Use when vectorizing a portrait or photo where subject detail should outrank background detail.

134SKILL.mdUpdated Jul 26, 2026

oaustegard/svg-portrait-mode

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/oaustegard/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/bm25 ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

oaustegard/claude-skills

122 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT