skills/nlp/lda-topic-modeling/SKILL.md
Latent Dirichlet Allocation on CountVectorizer bag-of-words to discover latent topics with per-document topic distributions for feature engineering or EDA
npx skillsauth add wenmin-wu/ds-skills nlp-lda-topic-modelingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Latent Dirichlet Allocation discovers latent topics in a text corpus. Each document is modeled as a mixture of topics, each topic as a distribution over words. Fit LDA on a CountVectorizer bag-of-words matrix to extract topic distributions per document — usable as features for downstream models or for understanding corpus structure.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
cvec = CountVectorizer(min_df=4, max_features=50000,
ngram_range=(1, 2), stop_words='english')
bow = cvec.fit_transform(texts)
lda = LatentDirichletAllocation(
n_components=20, learning_method='online',
max_iter=20, random_state=42)
topic_dist = lda.fit_transform(bow) # shape: (n_docs, n_topics)
vocab = cvec.get_feature_names_out()
for i, topic in enumerate(lda.components_):
top_words = [vocab[j] for j in topic.argsort()[:-11:-1]]
print(f"Topic {i}: {', '.join(top_words)}")
CountVectorizer (not TF-IDF — LDA expects raw counts)learning_method='online' for large corporafit_transform returns per-document topic distributions (features)lda.components_ to label topics by their top wordsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF