extracting-keywords/SKILL.md
Extract keywords from documents using YAKE algorithm with support for 34 languages (Arabic to Chinese). Use when users request keyword extraction, key terms, topic identification, content summarization, or document analysis. Includes domain-specific stopwords for AI/ML and life sciences. Optional deeper extraction mode (n=2+n=3 combined) for comprehensive coverage.
npx skillsauth add oaustegard/claude-skills extracting-keywordsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Extract keywords from text using YAKE (Yet Another Keyword Extractor), an unsupervised statistical keyword extraction algorithm.
First time only: Install YAKE with optimized dependencies to avoid unnecessary downloads.
cd /home/claude
uv venv yake-venv --system-site-packages
uv pip install yake --python yake-venv/bin/python --no-deps
uv pip install jellyfish segtok regex --python yake-venv/bin/python
This reuses system packages (numpy, networkx) instead of downloading them (~0.08s vs ~5s).
Built-in YAKE stopwords (34 languages): Use lan="<code>" parameter
lan="en") is the defaultCustom domain stopwords (bundled in assets/):
AI/ML: stopwords_ai.txt
Life Sciences: stopwords_ls.txt
import yake
# Read text
with open('document.txt', 'r') as f:
text = f.read()
# Extract with English stopwords (default)
kw_extractor = yake.KeywordExtractor(
lan="en", # Language code
n=3, # Max n-gram size (1-3 word phrases)
dedupLim=0.9, # Deduplication threshold (0-1)
top=20 # Number of keywords to return
)
keywords = kw_extractor.extract_keywords(text)
# Display results (lower score = more important)
for kw, score in keywords:
print(f"{score:.4f} {kw}")
Option 1: Install custom stopwords file
# Copy life sciences stopwords to YAKE package
cp assets/stopwords_ls.txt /home/claude/yake-venv/lib/python3.12/site-packages/yake/core/StopwordsList/stopwords_ls.txt
# Use with lan="ls"
kw_extractor = yake.KeywordExtractor(lan="ls", n=3, top=20)
Option 2: Load custom stopwords directly
# Load stopwords from file
with open('assets/stopwords_ls.txt', 'r') as f:
custom_stops = set(line.strip().lower() for line in f)
# Pass to extractor
kw_extractor = yake.KeywordExtractor(
stopwords=custom_stops,
n=3,
top=20
)
# Load AI/ML stopwords
with open('/mnt/skills/user/extracting-keywords/assets/stopwords_ai.txt', 'r') as f:
ai_stops = set(line.strip().lower() for line in f)
# Extract with AI stopwords
kw_extractor = yake.KeywordExtractor(
stopwords=ai_stops,
n=3,
top=20
)
keywords = kw_extractor.extract_keywords(text)
For more comprehensive extraction, run both n=2 and n=3 and consolidate results. This captures both focused phrases and broader context with ~100% time overhead (still <2s for large documents).
import yake
# Load domain stopwords
with open('/mnt/skills/user/extracting-keywords/assets/stopwords_ai.txt', 'r') as f:
stops = set(line.strip().lower() for line in f)
# Extract with n=2 (captures focused phrases)
kw_n2 = yake.KeywordExtractor(stopwords=stops, n=2, dedupLim=0.9, top=50)
results_n2 = kw_n2.extract_keywords(text)
# Extract with n=3 (captures broader context)
kw_n3 = yake.KeywordExtractor(stopwords=stops, n=3, dedupLim=0.9, top=50)
results_n3 = kw_n3.extract_keywords(text)
# Consolidate: union with score averaging for overlaps
combined = {}
for kw, score in results_n2:
combined[kw] = score
for kw, score in results_n3:
if kw in combined:
combined[kw] = (combined[kw] + score) / 2
else:
combined[kw] = score
# Sort by score (lower = more important)
consolidated = sorted(combined.items(), key=lambda x: x[1])
# Display top 30
for kw, score in consolidated[:30]:
print(f"{score:.4f} {kw}")
Benefits:
Performance:
lan (str): Language code for built-in stopwords
"en" - English (default)"ai" - AI/ML (if stopwords_ai.txt installed in YAKE)"ls" - Life sciences (if stopwords_ls.txt installed in YAKE)Built-in YAKE languages (34 total):
"ar" - Arabic"bg" - Bulgarian"br" - Breton"cz" - Czech"da" - Danish"de" - German"el" - Greek"es" - Spanish"et" - Estonian"fa" - Farsi/Persian"fi" - Finnish"fr" - French"hi" - Hindi"hr" - Croatian"hu" - Hungarian"hy" - Armenian"id" - Indonesian"it" - Italian"ja" - Japanese"lt" - Lithuanian"lv" - Latvian"nl" - Dutch"no" - Norwegian"pl" - Polish"pt" - Portuguese"ro" - Romanian"ru" - Russian"sk" - Slovak"sl" - Slovenian"sv" - Swedish"tr" - Turkish"uk" - Ukrainian"zh" - Chinesen (int): Maximum n-gram size (default: 3)
1 - Single words only2 - Up to 2-word phrases3 - Up to 3-word phrases (recommended)4-5 - May produce suboptimal results with YAKE's algorithmdedupLim (float): Deduplication threshold (default: 0.9)
top (int): Number of keywords to return (default: 20)
stopwords (set): Custom stopwords set (overrides lan parameter)
import yake
# Read document
with open('/mnt/user-data/uploads/article.txt', 'r') as f:
text = f.read()
# Extract keywords
kw_extractor = yake.KeywordExtractor(lan="en", n=3, top=30)
keywords = kw_extractor.extract_keywords(text)
# Format results
results = []
for kw, score in keywords:
results.append(f"{score:.4f} {kw}")
print("\n".join(results))
import yake
# Load life sciences stopwords
with open('assets/stopwords_ls.txt', 'r') as f:
ls_stops = set(line.strip().lower() for line in f)
# Extract with English stopwords
kw_en = yake.KeywordExtractor(lan="en", n=3, top=20)
keywords_en = kw_en.extract_keywords(text)
# Extract with life sciences stopwords
kw_ls = yake.KeywordExtractor(stopwords=ls_stops, n=3, top=20)
keywords_ls = kw_ls.extract_keywords(text)
# Compare results
print("English stopwords:")
for kw, score in keywords_en:
print(f" {score:.4f} {kw}")
print("\nLife sciences stopwords:")
for kw, score in keywords_ls:
print(f" {score:.4f} {kw}")
import yake
import os
# Initialize extractor
kw_extractor = yake.KeywordExtractor(lan="en", n=3, top=15)
# Process multiple files
results = {}
for filename in os.listdir('/mnt/user-data/uploads'):
if filename.endswith('.txt'):
with open(f'/mnt/user-data/uploads/{filename}', 'r') as f:
text = f.read()
keywords = kw_extractor.extract_keywords(text)
results[filename] = keywords
# Output results
for filename, keywords in results.items():
print(f"\n{filename}:")
for kw, score in keywords[:10]: # Top 10
print(f" {score:.4f} {kw}")
import yake
# French document
with open('/mnt/user-data/uploads/article_fr.txt', 'r') as f:
french_text = f.read()
# Extract with French stopwords
kw_fr = yake.KeywordExtractor(lan="fr", n=3, top=20)
keywords_fr = kw_fr.extract_keywords(french_text)
print("Mots-clés (French):")
for kw, score in keywords_fr:
print(f" {score:.4f} {kw}")
# German document
with open('/mnt/user-data/uploads/artikel_de.txt', 'r') as f:
german_text = f.read()
# Extract with German stopwords
kw_de = yake.KeywordExtractor(lan="de", n=3, top=20)
keywords_de = kw_de.extract_keywords(german_text)
print("\nSchlüsselwörter (German):")
for kw, score in keywords_de:
print(f" {score:.4f} {kw}")
for kw, score in keywords:
print(f"{kw}: {score:.4f}")
import csv
with open('/mnt/user-data/outputs/keywords.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Keyword', 'Score'])
writer.writerows(keywords)
import json
output = [{"keyword": kw, "score": score} for kw, score in keywords]
with open('/mnt/user-data/outputs/keywords.json', 'w') as f:
json.dump(output, f, indent=2)
/home/claude/yake-venv/bin/pythonImport errors: Verify venv installation
/home/claude/yake-venv/bin/python -c "import yake; print(yake.__version__)"
Empty results: Check text length (YAKE needs sufficient content, typically 100+ words)
Poor quality keywords: Adjust parameters:
dedupLim for more aggressive deduplicationtop to see more candidatesGeneric terms appearing: Add custom stopwords for your domain:
with open('assets/stopwords_ls.txt', 'r') as f:
stops = set(line.strip().lower() for line in f)
# Add domain-specific terms
stops.update(['term1', 'term2', 'term3'])
kw_extractor = yake.KeywordExtractor(stopwords=stops, n=3, top=20)
development
--- name: verifying-claims description: Check that a document's claims about code are actually true by reading the prose, the code, and the tests and reporting (or fixing) where they disagree. Use whenever the user wants to verify a README, guide, spec, or docstring still matches the code; whenever they mention documentation drift, doc-code sync, "is this still accurate", stale docs, or keeping docs/tests/code consistent; before publishing or merging a docs change; or as a periodic doc-accuracy
tools
Query, filter, and transform Markdown structurally with mq — a jq-like CLI for Markdown. Use to extract headings/sections/code-blocks/links from .md files, build a table of contents, pull code blocks of a given language, slice or reshape LLM prompt/output Markdown, or batch-transform docs. Triggers on "extract sections from this markdown", "get all the code blocks", "jq for markdown", "mq", or any structural query over Markdown that grep/Read can't do cleanly.
development
Composes single-file HTML artifacts (PR review writeups, status reports, incident postmortems, slide decks, design systems, prototypes, flowcharts, module maps, feature explainers, kanban boards, prompt tuners) from a small JSON spec instead of hand-written HTML/CSS/JS. Use when the user asks to "compare options side-by-side", requests an HTML version of a report or review or deck, asks for a flowchart, status update, postmortem, design system reference, interactive prototype, custom editor — or explicitly says "HTML artifact", "single HTML file", "self-contained HTML". Skip for ad-hoc HTML snippets (forms, emails, embedded widgets) where there's no template fit.
development
DAG workflow runner that encodes control flow in code, not prose. Use when a procedure has 3+ steps with branching, retries, or validation that must be enforced — gates as `when=`, edge contracts as `validate=`, predicate loops as `retry_until=`. The runner owns the graph; the LLM provides leaves. Also covers parallel execution, checkpoint resume, detached side-effects.