skills/nlp/tfidf-to-boolean-query/SKILL.md
Converts TF-IDF top-k terms into field-scoped boolean OR queries for structured document retrieval from a full-text index.
npx skillsauth add wenmin-wu/ds-skills nlp-tfidf-to-boolean-queryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Full-text search engines accept boolean queries (term1 OR term2 AND field:term3), but choosing which terms to include is non-trivial. This technique uses TF-IDF to rank vocabulary terms by importance, selects the top-k, qualifies each with its source field (title, abstract, classification code), and joins them into a boolean OR query. The result is a structured, field-aware query derived from the document's own content — useful for prior art search, duplicate detection, and similar-document retrieval.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
def select_top_k(tfidf_matrix, k=10):
"""Select top-k globally important terms by column sum."""
col_sums = np.asarray(tfidf_matrix.sum(axis=0)).flatten()
top_indices = np.argsort(-col_sums)[:k]
return top_indices
def build_boolean_query(doc_text, cpc_codes, ti_tfidf, cpc_tfidf, k=10):
"""Build a field-scoped boolean OR query from TF-IDF top-k."""
# Get top-k title terms
ti_matrix = ti_tfidf.transform([doc_text])
ti_indices = select_top_k(ti_matrix, k)
ti_terms = ti_tfidf.get_feature_names_out()[ti_indices]
# Get top-k CPC codes
cpc_matrix = cpc_tfidf.transform([cpc_codes])
cpc_indices = select_top_k(cpc_matrix, k)
cpc_terms = cpc_tfidf.get_feature_names_out()[cpc_indices]
# Build field-qualified query
parts = [f"ti:{t}" for t in ti_terms] + [f"cpc:{c}" for c in cpc_terms]
return " OR ".join(parts)
# Fit TF-IDF on corpus, then build queries
ti_tfidf = TfidfVectorizer(max_features=5000).fit(titles)
cpc_tfidf = TfidfVectorizer(analyzer='word').fit(cpc_strings)
query = build_boolean_query(doc_title, doc_cpc, ti_tfidf, cpc_tfidf)
# → "ti:neural OR ti:network OR cpc:G06F OR cpc:H04L"
ti:, abs:, cpc:)OR to form the boolean querydata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF