machine-learning/omics-classifiers/SKILL.md
Builds classification models for omics data using RandomForest, XGBoost, and logistic regression with sklearn-compatible APIs. Includes proper preprocessing and evaluation metrics for biomarker classifiers. Use when building diagnostic or prognostic classifiers from expression or variant data.
npx skillsauth add GPTomics/bioSkills bio-machine-learning-omics-classifiersInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: matplotlib 3.8+, pandas 2.2+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Build a classifier from my gene expression data" -> Train RandomForest, XGBoost, or logistic regression models on omics features with proper preprocessing and evaluation metrics.
sklearn.ensemble.RandomForestClassifier(), xgboost.XGBClassifier()Goal: Train a classification model on omics data and evaluate its predictive performance.
Approach: Build a scaled pipeline with a Random Forest classifier, fit on training data, and assess with ROC-AUC on held-out test data.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
y_prob = pipe.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
Goal: Train a gradient-boosted tree classifier using the sklearn-compatible XGBoost API.
Approach: Configure XGBClassifier with proper parameter names (avoiding deprecated aliases) and wrap in a scaling pipeline.
from xgboost import XGBClassifier
# Use sklearn-compatible API with proper parameters (avoid deprecated seed, nthread)
xgb = XGBClassifier(
n_estimators=100,
max_depth=6,
learning_rate=0.1,
random_state=42, # NOT seed
n_jobs=-1, # NOT nthread
eval_metric='logloss'
)
pipe = Pipeline([('scaler', StandardScaler()), ('clf', xgb)])
pipe.fit(X_train, y_train)
Goal: Build an interpretable linear classifier that simultaneously selects sparse biomarker features.
Approach: Use L1-regularized logistic regression with built-in cross-validation for penalty selection, then extract nonzero coefficients as selected features.
from sklearn.linear_model import LogisticRegressionCV
# L1 for sparse biomarkers, L2 for correlated features, elasticnet for mixed
logit = LogisticRegressionCV(
Cs=10,
cv=5,
penalty='l1',
solver='saga',
max_iter=1000,
random_state=42
)
pipe = Pipeline([('scaler', StandardScaler()), ('clf', logit)])
pipe.fit(X_train, y_train)
# Get selected features (nonzero coefficients)
feature_mask = logit.coef_[0] != 0
selected = X.columns[feature_mask]
Goal: Generate a publication-quality ROC curve showing classifier discrimination ability.
Approach: Compute false/true positive rates from predicted probabilities and plot with AUC annotation.
fpr, tpr, _ = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.savefig('roc_curve.png', dpi=150)
Goal: Handle classification tasks with more than two classes while addressing class imbalance.
Approach: Encode labels numerically and use balanced class weights to upweight underrepresented classes during training.
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# Use class_weight for imbalanced data
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
Goal: Rank features by their contribution to tree-based classifier predictions.
Approach: Extract Gini importances from a fitted Random Forest and sort to identify top contributing features.
import pandas as pd
importances = pipe.named_steps['clf'].feature_importances_
feature_imp = pd.DataFrame({'feature': X.columns, 'importance': importances})
feature_imp = feature_imp.sort_values('importance', ascending=False).head(20)
| Data Type | Scaler | Notes | |-----------|--------|-------| | Log-counts (RNA-seq) | StandardScaler | Assumes ~normal after log | | TPM/FPKM | StandardScaler | Gene-wise centering | | Raw counts | None | Tree models handle counts | | Mixed features | ColumnTransformer | Different scalers per type |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.