machine-learning/biomarker-discovery/SKILL.md
Selects informative features for biomarker discovery using Boruta all-relevant selection, mRMR minimum redundancy, and LASSO regularization. Use when identifying biomarkers from high-dimensional omics data.
npx skillsauth add GPTomics/bioSkills bio-machine-learning-biomarker-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: numpy 1.26+, pandas 2.2+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Find the best biomarkers in my omics data" -> Select informative features using all-relevant selection (Boruta), minimum redundancy (mRMR), or regularization (LASSO) to identify candidate biomarkers.
BorutaPy(rf, n_estimators='auto'), sklearn.linear_model.LassoCV()Identifies all features that are significantly better than random (shadow features).
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
# max_iter=100: Typically sufficient; increase to 200 if many features remain tentative
# perc=100: Use max of shadow features (default); lower for stricter selection
boruta = BorutaPy(rf, n_estimators='auto', max_iter=100, random_state=42, verbose=0)
boruta.fit(X.values, y)
selected = X.columns[boruta.support_]
tentative = X.columns[boruta.support_weak_]
print(f'Selected: {len(selected)}, Tentative: {len(tentative)}')
feature_ranks = pd.DataFrame({
'feature': X.columns,
'rank': boruta.ranking_,
'selected': boruta.support_
}).sort_values('rank')
Selects features that are individually relevant but minimally redundant with each other.
from mrmr import mrmr_classif
# K: Number of features to select; start with 50-100 for omics
selected_features = mrmr_classif(X=X, y=pd.Series(y), K=50)
X_selected = X[selected_features]
L1 regularization drives irrelevant coefficients to zero.
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# cv=5: Standard for selection; eps and n_alphas control alpha grid
lasso = LassoCV(cv=5, random_state=42)
lasso.fit(X_scaled, y)
selected_mask = lasso.coef_ != 0
selected = X.columns[selected_mask]
print(f'LASSO selected {len(selected)} features at alpha={lasso.alpha_:.4f}')
coefs = pd.Series(lasso.coef_, index=X.columns)
nonzero = coefs[coefs != 0].sort_values(key=abs, ascending=False)
Reduce dimensionality before more expensive methods.
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
# f_classif: Fast, assumes normality; good for log-counts
# mutual_info_classif: Nonlinear relationships but slower
# k=1000: Reasonable pre-filter; increase for larger omics datasets (>10k features)
selector = SelectKBest(f_classif, k=1000)
X_filtered = selector.fit_transform(X, y)
selected_idx = selector.get_support(indices=True)
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Pre-filter then Boruta for efficiency
pipe = Pipeline([
('prefilter', SelectKBest(f_classif, k=5000)),
('boruta', BorutaPy(RandomForestClassifier(n_jobs=-1), max_iter=100, random_state=42))
])
# Note: BorutaPy doesn't follow sklearn API perfectly; manual fit may be needed
| Method | Strengths | Weaknesses | Use When | |--------|-----------|------------|----------| | Boruta | Finds all relevant features | Slow on large data | Want complete biomarker panel | | mRMR | Reduces redundancy | Fixed K | Want compact signature | | LASSO | Sparse, interpretable | Picks one of correlated | Want minimal predictive set | | Univariate | Fast | Ignores interactions | Pre-filtering |
Goal: Identify biomarkers that are robustly selected across different data subsets, filtering out features that are only informative in specific subsamples.
Approach: Run LASSO feature selection on many bootstrap resamples, count how often each feature is selected across all iterations, and retain only features selected in more than 60% of bootstrap samples.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
import numpy as np
n_bootstrap = 100
selection_counts = np.zeros(X.shape[1])
for i in range(n_bootstrap):
idx = np.random.choice(len(X), size=len(X), replace=True)
X_boot, y_boot = X.iloc[idx], y[idx]
lasso = LogisticRegression(penalty='l1', solver='saga', C=0.1, max_iter=1000)
lasso.fit(X_boot, y_boot)
selection_counts += (lasso.coef_[0] != 0)
# stability_threshold=0.6: Features selected in >60% of bootstrap samples
stable_features = X.columns[selection_counts / n_bootstrap > 0.6]
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.