skills/nlp/tfidf-ngram-classifier/SKILL.md
High n-gram TF-IDF (3-5 grams) with sublinear TF feeding into a weighted soft-voting ensemble of traditional ML classifiers.
npx skillsauth add wenmin-wu/ds-skills nlp-tfidf-ngram-classifierInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For text classification where transformer fine-tuning is too slow or data is limited, use high-order n-gram TF-IDF (3-5 grams) with sublinear term frequency. Feed into a weighted soft-voting ensemble of MultinomialNB + SGDClassifier + LightGBM/CatBoost. This approach is fast, interpretable, and surprisingly competitive — it reached 0.96+ AUC on AI text detection.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import VotingClassifier
# Vectorize with high n-grams
vectorizer = TfidfVectorizer(
ngram_range=(3, 5),
sublinear_tf=True,
strip_accents='unicode',
lowercase=False,
analyzer='word',
)
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
# Ensemble of diverse classifiers
nb = MultinomialNB(alpha=0.02)
sgd = SGDClassifier(max_iter=8000, tol=1e-4, loss='modified_huber')
ensemble = VotingClassifier(
estimators=[('nb', nb), ('sgd', sgd)],
weights=[0.3, 0.7],
voting='soft',
n_jobs=-1,
)
ensemble.fit(X_train, y_train)
preds = ensemble.predict_proba(X_test)[:, 1]
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF