skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/scikit-learn/SKILL.md
Machine learning: clustering, PCA/t-SNE/UMAP, classification, prediction regression (Ridge/Lasso/ensemble), cross-validation, Pipelines. For unsupervised analysis, classification, or prediction. For econometric regression use pyfixest/statsmodels.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research scikit-learnInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
General-purpose machine learning with scikit-learn. Covers unsupervised methods (clustering, GMM, PCA, t-SNE, UMAP, manifold learning, evaluation metrics), supervised methods (classification, prediction-focused regression via Ridge/Lasso/ensemble methods, model evaluation, cross-validation), and shared infrastructure (preprocessing, Pipeline construction, feature selection). Use when performing cluster analysis, dimension reduction, classification, prediction-focused regression, or model evaluation in Python. For econometric regression (OLS, FE, IV, DiD), see pyfixest and statsmodels skills instead.
Comprehensive skill for machine learning in Python with scikit-learn. Covers unsupervised methods (clustering, decomposition, manifold learning), supervised methods (classification, regression), and shared infrastructure (preprocessing, pipelines, evaluation). Use decision trees below to find the right guidance, then load detailed references.
scikit-learn is the standard general-purpose machine learning library for Python:
fit() / predict() / transform() — learn once, apply everywhereThis skill targets scikit-learn 1.8.0. Notable changes in recent versions:
set_output(transform="pandas") for DataFrame output from transformers (1.2+)n_init="auto" default for KMeans (1.4+) — uses 10 for init="random", 1 for init="k-means++"Each topic in ./references/ contains focused documentation:
| File | Purpose | When to Read |
|------|---------|--------------|
| quickstart.md | Import patterns, fit/predict/transform API, Pipeline, train_test_split | First use of scikit-learn |
| clustering.md | KMeans, AgglomerativeClustering, DBSCAN, HDBSCAN, SpectralClustering, OPTICS | Cluster analysis tasks |
| mixture-models.md | GaussianMixture, BayesianGaussianMixture, BIC/AIC model selection | Model-based clustering, soft assignments |
| decomposition.md | PCA, KernelPCA, TruncatedSVD, NMF, IncrementalPCA | Dimension reduction tasks |
| manifold.md | t-SNE, UMAP (umap-learn), Isomap, LLE, MDS, SpectralEmbedding | Visualizing high-dimensional data |
| evaluation-unsupervised.md | silhouette_score, Davies-Bouldin, Calinski-Harabasz, ARI, NMI, gap statistic | Validating cluster solutions |
| preprocessing.md | StandardScaler, encoders, ColumnTransformer, Pipeline construction | Preparing data for ML |
| classification.md | LogisticRegression, RandomForest, GradientBoosting, SVC, KNeighbors | Classification tasks |
| regression-ml.md | Ridge, Lasso, ElasticNet, tree/ensemble regressors, SVR | ML regression (prediction-focused) |
| evaluation-supervised.md | Accuracy, F1, ROC-AUC, confusion matrix, cross_val_score, GridSearchCV | Evaluating supervised models |
| feature-selection.md | SelectKBest, RFE, permutation_importance, VarianceThreshold | Selecting informative features |
| gotchas.md | Data leakage, scaling errors, t-SNE misinterpretation, class imbalance | Avoiding common mistakes |
| interpretation.md | SHAP values (TreeExplainer, KernelExplainer), permutation importance visualization, partial dependence plots, ICE plots | After training a model, when interpretation or explanation is needed |
| fairness.md | fairlearn MetricFrame, ThresholdOptimizer, ExponentiatedGradient, demographic parity, equalized odds | Assessing or mitigating fairness of supervised models |
quickstart.md then the task-specific referenceclustering.md, then evaluation-unsupervised.mdclassification.md, then evaluation-supervised.mdpreprocessing.md (covers Pipeline construction)gotchas.md firstinterpretation.md, then check supervised-ml.md in data-scientist skill for methodologyfairness.md, then check supervised-ml.md in data-scientist skill for conceptual framework| Skill | Relationship |
|-------|-------------|
| data-scientist | Methodology guidance — load exploratory-unsupervised.md for "when and why" behind unsupervised methods |
| pyfixest | Econometric regression: OLS with fixed effects, IV, DiD, clustered SEs, hypothesis testing |
| statsmodels | Statistical modeling: OLS without FE, GLM, time series, diagnostic tests |
| polars | Data preparation before ML (convert to pandas/numpy before passing to scikit-learn) |
| geopandas | Spatial analysis — use geopandas for geographic data, not scikit-learn |
| plotnine | Custom visualization beyond scikit-learn's built-in plotting |
| data-scientist | Load supervised-ml.md for supervised ML methodology — the "when and why" behind prediction, interpretation, and fairness |
Routing guidance:
pyfixest or statsmodels — not scikit-learnexploratory-unsupervised.md in the data-scientist skillgeopandaspolarsWhat kind of data and clusters?
├─ Continuous data, roughly spherical clusters
│ ├─ Know k → KMeans (./references/clustering.md)
│ └─ Don't know k → try multiple k + silhouette/gap
│ (./references/clustering.md + ./references/evaluation-unsupervised.md)
├─ Continuous data, arbitrary shapes
│ ├─ Dense clusters, possible noise → DBSCAN or HDBSCAN (./references/clustering.md)
│ └─ Need soft assignments → GaussianMixture (./references/mixture-models.md)
├─ Need hierarchy / dendrogram → AgglomerativeClustering (./references/clustering.md)
├─ Mixed data types → Gower distance workaround (./references/gotchas.md)
└─ Need probabilistic model comparison → GaussianMixture with BIC
(./references/mixture-models.md)
What is the goal?
├─ Linear reduction for subsequent analysis → PCA (./references/decomposition.md)
├─ Large sparse data → TruncatedSVD (./references/decomposition.md)
├─ Non-negative components → NMF (./references/decomposition.md)
├─ Visualization of structure → t-SNE or UMAP (./references/manifold.md)
│ └─ CAUTION: visualization only, not for analysis
│ (see data-scientist exploratory-unsupervised.md for methodology)
├─ Nonlinear manifold learning → Isomap or LLE (./references/manifold.md)
└─ Correspondence analysis (CA, MCA) → use the prince library
What constraints?
├─ Interpretable model needed → LogisticRegression or DecisionTreeClassifier
│ (./references/classification.md)
├─ Best predictive performance → GradientBoostingClassifier or RandomForestClassifier
│ (./references/classification.md)
├─ High-dimensional sparse data → LogisticRegression with penalty
│ (./references/classification.md)
├─ Small dataset, few features → KNeighborsClassifier or SVC
│ (./references/classification.md)
└─ Need probability estimates → any classifier with predict_proba()
(./references/classification.md)
What kind of regression?
├─ NOTE: For econometric regression (hypothesis testing, standard errors,
│ coefficient interpretation), use pyfixest or statsmodels instead
├─ Prediction-focused, nonlinear → GradientBoostingRegressor or RandomForestRegressor
│ (./references/regression-ml.md)
├─ High-dimensional with regularization → Lasso, Ridge, or ElasticNet
│ (./references/regression-ml.md)
├─ Nonlinear relationships → GradientBoostingRegressor or SVR
│ (./references/regression-ml.md)
└─ Simple baseline → Ridge (./references/regression-ml.md)
What kind of evaluation?
├─ Unsupervised (no ground truth)
│ ├─ Cluster quality → silhouette_score, Davies-Bouldin
│ │ (./references/evaluation-unsupervised.md)
│ ├─ Stability → Bootstrap + compare across resamples
│ │ (./references/evaluation-unsupervised.md)
│ └─ Against known labels → ARI, NMI
│ (./references/evaluation-unsupervised.md)
├─ Supervised classification
│ ├─ Balanced classes → accuracy + F1 (./references/evaluation-supervised.md)
│ ├─ Imbalanced classes → precision, recall, ROC-AUC
│ │ (./references/evaluation-supervised.md)
│ └─ Model selection → cross_val_score or GridSearchCV
│ (./references/evaluation-supervised.md)
└─ Supervised regression
├─ R-squared, RMSE, MAE (./references/evaluation-supervised.md)
└─ Model selection → cross_val_score or GridSearchCV
(./references/evaluation-supervised.md)
What kind of interpretation?
├─ Feature importance (global) → SHAP beeswarm/bar or permutation importance
│ (./references/interpretation.md)
├─ Single prediction explanation → SHAP waterfall or force plot
│ (./references/interpretation.md)
├─ Feature effect visualization → PDP or SHAP dependence plot
│ (./references/interpretation.md)
├─ Fairness across demographic groups → MetricFrame
│ (./references/fairness.md)
└─ CAUTION: feature importance ≠ causal importance
(see data-scientist supervised-ml.md for methodology)
Important: In data research pipelines (see CLAUDE.md), scikit-learn analyses are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
scripts/stage8_analysis/{step}_{task-name}.pyClosely read agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules. All ML scripts must follow the Inline Audit Trail (IAT) standard -- see agent_reference/INLINE_AUDIT_TRAIL.md. For ML code, document model selection rationale (why this algorithm, why these hyperparameters, what assumptions) with # INTENT:, # REASONING:, and # ASSUMES: comments.
See:
agent_reference/WORKFLOW_PHASE4_ANALYSIS.md -- Stage 8 (Analysis & Visualization)agent_reference/INLINE_AUDIT_TRAIL.md -- IAT documentation standardThe examples below show scikit-learn syntax. In research workflows, wrap them in scripts following the file-first pattern.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Supervised: fit + predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Unsupervised: fit + transform (or fit_transform)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Clustering: fit + labels_
kmeans.fit(X)
labels = kmeans.labels_
| Operation | Code |
|-----------|------|
| Train-test split | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
| Scale features | X_scaled = StandardScaler().fit_transform(X) |
| Build pipeline | pipe = make_pipeline(StandardScaler(), LogisticRegression()) |
| Cross-validate | scores = cross_val_score(model, X, y, cv=5, scoring="accuracy") |
| Grid search | GridSearchCV(model, param_grid, cv=5, scoring="accuracy") |
| KMeans clustering | KMeans(n_clusters=k, n_init=10, random_state=42).fit(X) |
| PCA | PCA(n_components=5).fit_transform(X_scaled) |
| Logistic regression | LogisticRegression(max_iter=1000).fit(X_train, y_train) |
| Random forest | RandomForestClassifier(n_estimators=100, random_state=42).fit(X_train, y_train) |
| Gradient boosting | HistGradientBoostingClassifier().fit(X_train, y_train) |
| Classification report | print(classification_report(y_test, y_pred)) |
| Confusion matrix | confusion_matrix(y_test, y_pred) |
| Silhouette score | silhouette_score(X, labels) |
| Feature importance | model.feature_importances_ |
| Permutation importance | permutation_importance(model, X_test, y_test, random_state=42) |
| Set output format | model.set_output(transform="pandas") |
| Topic | Reference File |
|-------|---------------|
| Installation and imports | ./references/quickstart.md |
| fit/predict/transform API | ./references/quickstart.md |
| Pipeline construction | ./references/quickstart.md |
| Train-test split | ./references/quickstart.md |
| Reproducibility (random_state) | ./references/quickstart.md |
| KMeans, MiniBatchKMeans | ./references/clustering.md |
| AgglomerativeClustering | ./references/clustering.md |
| DBSCAN, HDBSCAN, OPTICS | ./references/clustering.md |
| SpectralClustering | ./references/clustering.md |
| GaussianMixture | ./references/mixture-models.md |
| BayesianGaussianMixture | ./references/mixture-models.md |
| BIC/AIC model selection | ./references/mixture-models.md |
| Soft cluster assignments | ./references/mixture-models.md |
| PCA, KernelPCA | ./references/decomposition.md |
| TruncatedSVD (sparse data) | ./references/decomposition.md |
| NMF | ./references/decomposition.md |
| IncrementalPCA | ./references/decomposition.md |
| t-SNE | ./references/manifold.md |
| UMAP (umap-learn) | ./references/manifold.md |
| Isomap, LLE, MDS | ./references/manifold.md |
| silhouette_score | ./references/evaluation-unsupervised.md |
| Davies-Bouldin, Calinski-Harabasz | ./references/evaluation-unsupervised.md |
| Adjusted Rand Index, NMI | ./references/evaluation-unsupervised.md |
| Gap statistic | ./references/evaluation-unsupervised.md |
| StandardScaler, MinMaxScaler | ./references/preprocessing.md |
| OneHotEncoder, OrdinalEncoder | ./references/preprocessing.md |
| ColumnTransformer | ./references/preprocessing.md |
| Pipeline, make_pipeline | ./references/preprocessing.md |
| LogisticRegression | ./references/classification.md |
| DecisionTreeClassifier | ./references/classification.md |
| RandomForestClassifier | ./references/classification.md |
| GradientBoostingClassifier | ./references/classification.md |
| SVC, KNeighborsClassifier | ./references/classification.md |
| Ridge, Lasso, ElasticNet | ./references/regression-ml.md |
| RandomForestRegressor | ./references/regression-ml.md |
| GradientBoostingRegressor | ./references/regression-ml.md |
| SVR, KNeighborsRegressor | ./references/regression-ml.md |
| accuracy, precision, recall, F1 | ./references/evaluation-supervised.md |
| ROC-AUC, confusion matrix | ./references/evaluation-supervised.md |
| cross_val_score, GridSearchCV | ./references/evaluation-supervised.md |
| learning_curve | ./references/evaluation-supervised.md |
| SelectKBest, RFE | ./references/feature-selection.md |
| feature_importances_ | ./references/feature-selection.md |
| permutation_importance | ./references/feature-selection.md |
| Data leakage | ./references/gotchas.md |
| Scaling for distance-based methods | ./references/gotchas.md |
| t-SNE/UMAP distance interpretation | ./references/gotchas.md |
| Class imbalance | ./references/gotchas.md |
| random_state reproducibility | ./references/gotchas.md |
| SHAP values (TreeExplainer, KernelExplainer) | ./references/interpretation.md |
| Permutation importance visualization | ./references/interpretation.md |
| Partial dependence plots (PDP) | ./references/interpretation.md |
| ICE plots | ./references/interpretation.md |
| Model interpretation caveats | ./references/interpretation.md |
| fairlearn MetricFrame | ./references/fairness.md |
| ThresholdOptimizer | ./references/fairness.md |
| ExponentiatedGradient | ./references/fairness.md |
| Demographic parity | ./references/fairness.md |
| Equalized odds | ./references/fairness.md |
| LightGBM (LGBMClassifier, LGBMRegressor) | ./references/classification.md, ./references/regression-ml.md |
| XGBoost (XGBClassifier, XGBRegressor) | ./references/classification.md, ./references/regression-ml.md |
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Cite when: scikit-learn is used for machine learning models, clustering, dimensionality reduction, or cross-validation central to the analysis. Do not cite when: Only used for a single preprocessing step (e.g., StandardScaler in a pipeline where the primary model is from another library).
For method-specific citations (e.g., individual algorithms or techniques),
consult the reference files in this skill and agent_reference/CITATION_REFERENCE.md.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.