skills/llm/tfidf-chunked-retrieval/SKILL.md
Scalable TF-IDF retrieval over large document corpora using frozen vocabulary and chunked top-k merging.
npx skillsauth add wenmin-wu/ds-skills llm-tfidf-chunked-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For large corpora (100K+ documents), compute TF-IDF similarity in chunks to avoid OOM. Fit vocabulary on the query corpus, freeze it, then apply to document chunks. Collect per-chunk top-k results and merge globally. This is faster than dense retrieval for keyword-heavy queries.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Fit vocab on queries, freeze for documents
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
vectorizer.fit(queries)
vocab = vectorizer.get_feature_names_out()
vectorizer_docs = TfidfVectorizer(ngram_range=(1, 2), vocabulary=vocab)
query_vecs = vectorizer_docs.fit_transform(queries)
# Chunked retrieval
chunk_size, top_k = 50000, 5
all_scores, all_indices = [], []
for start in range(0, len(documents), chunk_size):
chunk_vecs = vectorizer_docs.transform(documents[start:start + chunk_size])
scores = (query_vecs * chunk_vecs.T).toarray()
top_idx = scores.argpartition(-top_k, axis=1)[:, -top_k:]
all_indices.append(top_idx + start)
all_scores.append(np.take_along_axis(scores, top_idx, axis=1))
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF