skills/nlp/whoosh-fulltext-search-index/SKILL.md
Builds a Whoosh full-text search index over documents and queries it with boolean operators, field scoping, and proximity matching.
npx skillsauth add wenmin-wu/ds-skills nlp-whoosh-fulltext-search-indexInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Whoosh is a pure-Python full-text search library (no external dependencies) that supports boolean queries, field-scoped search, wildcards, and proximity operators. It's ideal for Kaggle competitions and prototyping where Elasticsearch is unavailable. Build an index over document fields (title, abstract, classification codes), then query with structured boolean expressions. Supports BM25 scoring out of the box.
import os
from whoosh import index
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import MultifieldParser, OrGroup
# Define schema
schema = Schema(
doc_id=ID(stored=True, unique=True),
title=TEXT(stored=True),
abstract=TEXT,
cpc=TEXT,
)
# Build index
ix_dir = "my_index"
os.makedirs(ix_dir, exist_ok=True)
ix = index.create_in(ix_dir, schema)
writer = ix.writer()
for doc in documents:
writer.add_document(
doc_id=doc['id'], title=doc['title'],
abstract=doc['abstract'], cpc=doc['cpc_codes']
)
writer.commit()
# Query with field scoping and boolean operators
ix = index.open_dir(ix_dir)
searcher = ix.searcher()
qp = MultifieldParser(["title", "abstract", "cpc"], schema, group=OrGroup)
query = qp.parse("title:neural OR abstract:classification OR cpc:G06F")
results = searcher.search(query, limit=50)
hits = [r['doc_id'] for r in results]
field:term), boolean ops (OR, AND), wildcards (term*)title:word searches only the title field — reduces false positivesADJ5 (within 5 positions) for phrase-like matching, but doesn't work with wildcardsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF