skills/llm/iterative-pairwise-keyword-extraction/SKILL.md
Iteratively prompts an LLM over document pairs to extract and deduplicate keywords, building a comprehensive term set from multiple perspectives.
npx skillsauth add wenmin-wu/ds-skills llm-iterative-pairwise-keyword-extractionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A single LLM prompt over one document produces narrow keywords biased by the prompt framing. By iterating over pairs — (anchor document, neighbor 1), (anchor, neighbor 2), etc. — each prompt highlights different aspects of the anchor. Keywords from all iterations are merged and deduplicated, producing a richer term set than any single extraction. This is especially effective for retrieval tasks where the query needs to cover multiple facets of the source document.
from transformers import AutoTokenizer, AutoModelForCausalLM
TEMPLATE = """Given these two documents, extract the key technical terms
that describe what the first document is about.
Document A: {title_a}
{abstract_a}
Document B: {title_b}
{abstract_b}
Keywords (comma separated):"""
def extract_keywords_pairwise(anchor, neighbors, model, tokenizer, max_tokens=50):
all_keywords = set()
for neighbor in neighbors:
prompt = TEMPLATE.format(
title_a=anchor['title'], abstract_a=anchor['abstract'],
title_b=neighbor['title'], abstract_b=neighbor['abstract'],
)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output_ids = model.generate(input_ids, max_new_tokens=max_tokens,
do_sample=False)
answer = tokenizer.decode(output_ids[0][input_ids.shape[1]:],
skip_special_tokens=True)
# Parse comma/semicolon/newline separated keywords
for sep in [",", ";", "\n"]:
if sep in answer:
terms = answer.split(sep)
break
else:
terms = [answer]
all_keywords.update(t.strip().lower() for t in terms if len(t.strip()) < 40)
return list(all_keywords)
# Get 5 nearest neighbors via embedding similarity
neighbors = get_nearest(anchor, corpus, k=5)
keywords = extract_keywords_pairwise(anchor, neighbors, model, tokenizer)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF