Version Compatibility

Reference examples tested with: numpy 1.26+, pandas 2.2+, scikit-learn 1.4+, boruta 0.4+, mrmr-selection 0.2+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures

BorutaPy expects numpy arrays and breaks on newer numpy where the np.float/np.int aliases were removed -- pin a compatible numpy or use a maintained fork. On scikit-learn 1.8+ the LogisticRegression(penalty=) argument is deprecated (removed in 1.10) in favor of l1_ratio+C; the examples show the 1.4-1.7 form. If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Feature Selection for Biomarker Discovery

"Find the biomarkers in my omics data" -> First decide which question is being answered (all-relevant vs minimal-optimal), then select features INSIDE a resampling loop, then quantify stability -- because a selected list means little without it.

All-relevant (which genes carry signal?): BorutaPy(rf)
Minimal-optimal (smallest predictive set?): ElasticNetCV, LogisticRegressionCV(penalty='elasticnet')
Stability (does the list reproduce?): bootstrap selection frequencies + a stability index

The Single Most Important Modern Insight -- Most Gene Signatures Do Not Replicate, and Significance Is the Wrong Bar

A signature being "significantly associated with outcome" is near-worthless evidence: random gene sets -- and signatures of biologically irrelevant phenomena -- are significantly associated with breast-cancer survival, often matching published prognostic signatures, because the transcriptome is dominated by a few axes (proliferation) that almost any large gene set captures (Venet 2011). The correct null is not "no association" but random gene sets of equal size plus a proliferation meta-gene. Two further hard facts complete the picture: many disjoint gene lists predict equally well (Ein-Dor 2005), so non-overlap with a prior list is the expected result, not a contradiction; and obtaining a stable list (as opposed to an accurate predictor) needs on the order of thousands of samples (Ein-Dor 2006), far more than typical omics n.

The operational consequences run through every section below: report a stability index next to accuracy; benchmark against a random-signature and proliferation-meta-gene null; never interpret the specific genes a minimal-optimal selector kept as "the biomarkers"; and keep selection inside the cross-validation loop or the reported performance is fiction.

All-Relevant vs Minimal-Optimal (the distinction usually conflated)

This axis matters more than filter/wrapper/embedded. Choosing the wrong one is the most common conceptual error in applied biomarker papers.

Minimal-optimal = the smallest subset giving optimal prediction (LASSO, RFE, forward selection). If two genes are correlated and both informative, it keeps one and drops the other; the dropped gene is still biologically relevant. Minimal-optimal sets are non-unique, unstable, and systematically exclude redundant-but-real features. Absence from a minimal-optimal set is not evidence of irrelevance.
All-relevant = every feature carrying information, redundant or not (Boruta: keep anything beating the best "shadow" permuted feature). This is the right framing for biological interpretation -- the whole co-expression module is wanted, not one representative.

Decision rule: parsimonious assay with few measurements -> minimal-optimal; understand biology / enumerate implicated genes / pathway analysis -> all-relevant; stable deployable signature -> elastic net or stability selection.

Methods Taxonomy

| Family | Method | Optimizes | Redundancy handling | Output | Key trap | |--------|--------|-----------|---------------------|--------|----------| | Filter (univariate) | t-test / SelectKBest(f_classif) | Marginal association, one gene at a time | None (keeps correlated blocks) | Ranked list | Ignores multivariate structure; huge multiplicity | | Filter (multivariate) | mRMR (Peng 2005) | Relevance minus redundancy | Explicit penalty | Ranked K | Greedy/first-order; K must still be chosen | | Wrapper | RFE / RFECV; SVM-RFE | A specific model's accuracy | Indirect | Ranked subset | Expensive; must be inside CV; SVM-RFE needs a linear kernel | | Embedded | LASSO (Tibshirani 1996) | Prediction + L1 sparsity | None -- arbitrarily keeps one of a correlated group | Sparse coefs | Unstable under collinearity; caps at n features when p>n | | Embedded | Elastic net (Zou-Hastie 2005) | Prediction + L1+L2 grouping | Keeps correlated groups together | Sparse coefs | Two hyperparameters; still not "causal" | | All-relevant | Boruta (Kursa 2010) | Every feature beating shadow features | Keeps all relevant (redundant included) | Confirmed/Tentative/Rejected | Slow; returns redundant sets by design | | Meta / stability | Stability selection (Meinshausen 2010; Shah-Samworth 2013) | Selection probability under subsampling | Inherits base learner | Selection frequencies + threshold | Error bounds assume exchangeability omics violates |

Decision Tree by Scenario

| Scenario | Recommended approach | Why | |----------|---------------------|-----| | Want every implicated gene for pathway/biology interpretation | Boruta (all-relevant), or stability-based consensus | Keeps whole correlated modules, not one representative | | Want a small deployable assay/signature | Elastic-net (not bare LASSO); report stability | L2 grouping keeps correlated genes together and resamples more stably | | p is huge (>20k); selection is slow | Univariate pre-filter to a few thousand, then Boruta/elastic-net, all inside the CV fold | Cheap dimensionality cut; never pre-filter on the full dataset | | Need to report model performance | Wrap selection in a Pipeline, estimate by nested CV | Selection outside CV inflates AUC to ~perfect on pure noise | | Single-cell biomarker across conditions | Pseudobulk per donor, then select at the donor level | The unit is the donor, not the cell (Squair 2021); cells are pseudoreplicates | | Want to know which genes "drive" a trained model | -> machine-learning/prediction-explanation | SHAP ranking is not validated selection | | Want unbiased accuracy/calibration of the selected model | -> machine-learning/model-validation | Selection is one step; validation is its own discipline |

Leakage-Safe Selection (the single most damaging error to avoid)

Goal: Estimate the performance of a selection-plus-model pipeline without optimistic bias.

Approach: Put selection in a Pipeline so it is re-fit on each training fold only; the held-out fold never informs which features are kept. Selecting the top-k features on the whole dataset before cross-validating the classifier produces near-zero apparent error even on pure noise (Ambroise-McLachlan 2002). Selection is where almost all overfitting capacity lives when p>>n.

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold

pipe = Pipeline([
    ('select', SelectKBest(f_classif, k=20)),               # re-fit per fold -> no leakage
    ('clf', LogisticRegression(penalty='l2', max_iter=5000)),
])
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
auc = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')   # honest estimate
print(f'Nested-safe AUC: {auc.mean():.3f} +/- {auc.std():.3f}')

The standalone Boruta/LASSO blocks below select features on a full matrix to discover candidates; that is fine for discovery, but any performance number must come from the Pipeline pattern above, with selection inside the fold.

All-Relevant: Boruta

Goal: Enumerate every feature carrying signal, including redundant co-expressed genes.

Approach: Compare each real feature's importance to the maximum importance of permuted "shadow" features over many iterations; confirm features that consistently beat the best shadow.

from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)
# perc=100 uses the max shadow importance (strict); two_step (default True) controls the multiple-testing correction.
boruta = BorutaPy(rf, n_estimators='auto', perc=100, two_step=True, max_iter=100, random_state=42)
boruta.fit(X.values, y.values)                              # numpy arrays, not pandas

confirmed = X.columns[boruta.support_]                      # all-relevant set (redundant by design)
tentative = X.columns[boruta.support_weak_]

Minimal-Optimal: Elastic Net (prefer over bare LASSO)

Goal: A small, stable predictive signature from correlated omics features.

Approach: Use elastic net, whose L2 term induces a grouping effect so correlated genes enter or leave together; standardize first because the penalty is scale-sensitive. Bare LASSO keeps one arbitrary member of a correlated group and flips on tiny data perturbations.

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)               # for real scoring, do this inside the Pipeline
# saga is the only solver supporting elasticnet; C = 1/lambda (opposite of alpha in Lasso/ElasticNet).
enet = LogisticRegressionCV(penalty='elasticnet', solver='saga',
                            l1_ratios=[0.1, 0.5, 0.9], Cs=20, cv=10, max_iter=10000)
enet.fit(X_scaled, y)
selected = X.columns[enet.coef_[0] != 0]

Stability: Are the Selected Features Reproducible?

Goal: Distinguish a robust signature from a resampling accident, and report stability alongside accuracy.

Approach: Run the selector on many subsamples, count per-feature selection frequency, keep features above a threshold (0.6 is the common default), and compute a chance-corrected stability index. Use the Nogueira 2018 measure (handles variable-size selections, gives a confidence interval); the older Kuncheva index needs equal-size subsets and breaks for LASSO.

import numpy as np
from sklearn.linear_model import LogisticRegression

n_subsample, p = 100, X.shape[1]
counts = np.zeros(p)
subsets = []
for _ in range(n_subsample):
    idx = np.random.choice(len(X), size=len(X) // 2, replace=False)   # n/2 subsampling
    fit = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, max_iter=2000).fit(X.iloc[idx], y.iloc[idx])
    mask = fit.coef_[0] != 0
    counts += mask
    subsets.append(mask.astype(int))

stable = X.columns[counts / n_subsample > 0.6]             # pi_thr=0.6: Meinshausen-Buhlmann default
# Nogueira stability index (chance-corrected; 1 = identical selections, ~0 = random):
Z = np.array(subsets); pbar = Z.mean(axis=0); k = Z.sum(axis=1)
stability = 1 - (Z.var(axis=0, ddof=1).mean()) / ((k.mean() / p) * (1 - k.mean() / p))
print(f'{len(stable)} stable features; Nogueira stability = {stability:.2f}')

Per-Method Failure Modes

Interpreting minimal-optimal membership as biology

Trigger: Reporting "LASSO selected gene X but not its co-expressed partner Y" as a biological finding.
Mechanism: L1 geometry keeps one vertex of a correlated group arbitrarily; the choice flips across resamples.
Symptom: Selected genes change completely on a different train/test split though accuracy is stable.
Fix: Use elastic net (grouping effect) or report selection frequencies; never read membership as importance ordering.

Selection-before-CV leakage

Trigger: Pick top-k features on all samples, then cross-validate the classifier on those features.
Mechanism: The held-out folds informed which genes were kept; selection is the dominant overfitting capacity in p>>n.
Symptom: Near-perfect CV accuracy, even reproducible on label-permuted (null) data; collapse on an external cohort.
Fix: Selection lives inside the CV fold (Pipeline); estimate by nested CV (machine-learning/model-validation).

Significance against the wrong null

Trigger: Concluding a signature is real because it significantly predicts outcome.
Mechanism: Random gene sets clear that bar; the transcriptome's proliferation axis is captured by almost any large set (Venet 2011).
Symptom: The signature does not beat a size-matched random signature or a proliferation meta-gene in independent data.
Fix: Benchmark against random-signature and proliferation-meta-gene nulls; require added value over clinical covariates in an independent cohort.

Winner's curse / inflated effect sizes

Trigger: Estimating effect sizes or AUC on the same data used to select features.
Mechanism: Selected features are disproportionately those whose noise inflated their apparent effect (Goring 2001); the inflation can be near-total for small true effects.
Symptom: Discovery AUC much higher than replication; replication is under-powered because it was sized to the inflated effect.
Fix: Estimate effects on an independent split (cross-fitting / data-splitting); size replication for the shrunken effect.

Pseudoreplication in single-cell selection

Trigger: Treating thousands of cells from a few donors as independent samples.
Mechanism: Cells within a donor are correlated; the effective n is the number of donors.
Symptom: Grossly inflated significance and false discoveries.
Fix: Pseudobulk per donor, select at the donor level (Squair 2021); confront the small true n.

Reconciliation: When Methods Disagree

| Pattern | Likely cause | Action | |---------|--------------|--------| | Boruta keeps 200 genes, LASSO keeps 12 | All-relevant vs minimal-optimal answering different questions | Both can be right; pick by goal, do not "average" them | | A list barely overlaps a published signature | Many disjoint equally-predictive lists exist (Ein-Dor 2005) | Expected, not a contradiction; compare performance and stability, not membership | | High accuracy, low stability index | Resampling accident exploiting a dominant axis | Distrust the specific genes; prefer the lower-accuracy higher-stability candidate | | FDR-clean list still fails to replicate | FDR controls testing, not selection stability | They are orthogonal; add stability + independent validation |

Quantitative Thresholds

| Threshold | Source | Rationale | |-----------|--------|-----------| | Samples for a stable gene list ~ thousands | Ein-Dor 2006 | Small effects need large n for reproducible membership (accuracy needs far fewer) | | Selection inside every CV fold; nested CV for tuning | Ambroise 2002; Simon 2003 | Selection outside CV gives ~0% error on noise | | Stability threshold pi_thr ~ 0.6-0.9 | Meinshausen-Buhlmann 2010 | Selection-frequency cutoff; tune to false-positive cost | | Random-signature null | Venet 2011 | Benchmark against size-matched random sets + proliferation meta-gene | | Single-cell unit = donor (pseudobulk) | Squair 2021 | Cells are pseudoreplicates | | Biomarker clinical translation rate <1% | Kern 2012 | Sets expectations; failures follow a foreseeable taxonomy |

Common Errors

| Error / symptom | Cause | Solution | |-----------------|-------|----------| | BorutaPy raises on np.float/pandas input | Newer numpy removed aliases; needs arrays | Pass X.values, y.values; pin numpy or use a fork | | Regularization strength backwards | C=1/lambda (logistic) vs alpha (Lasso/ElasticNet) are opposite conventions | Verify which API; small C = strong shrinkage | | elasticnet penalty errors | Only solver='saga' supports it | Set solver='saga', pass l1_ratio(s) | | mrmr_classif returns wrong type | Pandas backend needs a DataFrame X and Series y | Pass X DataFrame, y=pd.Series(y); K must still be chosen | | glmnet signature unstable across runs (R) | Used lambda.min | Use lambda.1se for a sparser, more reproducible set |

References

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267-288.
Goring HHH, Terwilliger JD, Blangero J. 2001. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69:1357-1369.
Ambroise C, McLachlan GJ. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99:6562-6566.
Simon R, Radmacher MD, Dobbin K, McShane LM. 2003. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95:14-18.
Ein-Dor L, Kela I, Getz G, Givol D, Domany E. 2005. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21:171-178.
Peng H, Long F, Ding C. 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226-1238.
Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Stat Soc B 67:301-320.
Ein-Dor L, Zuk O, Domany E. 2006. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS 103:5923-5928.
Kursa MB, Rudnicki WR. 2010. Feature selection with the Boruta package. J Stat Softw 36:1-13.
Meinshausen N, Buhlmann P. 2010. Stability selection. J R Stat Soc B 72:417-473.
Venet D, Dumont JE, Detours V. 2011. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol 7:e1002240.
Kern SE. 2012. Why your new cancer biomarker may never work: recurrent patterns and remarkable diversity in biomarker failures. Cancer Res 72:6097-6101.
Shah RD, Samworth RJ. 2013. Variable selection with error control: another look at stability selection. J R Stat Soc B 75:55-80.
Nogueira S, Sechidis K, Brown G. 2018. On the stability of feature selection algorithms. J Mach Learn Res 18:1-54.
Squair JW, Gautier M, Kathe C, et al. 2021. Confronting false discoveries in single-cell differential expression. Nat Commun 12:5692.

Related Skills

machine-learning/model-validation - Nested CV and leakage-safe estimation of the selected model
machine-learning/prediction-explanation - Why SHAP rankings are not a validated selection method
machine-learning/omics-classifiers - Build a classifier from the selected features
differential-expression/de-results - Pre-filter candidates with differential expression
experimental-design/multiple-testing - FDR control and why it is orthogonal to selection stability
experimental-design/power-analysis - Sample size for a stable signature vs an accurate predictor
pathway-analysis/go-enrichment - Functional enrichment of an all-relevant gene set

Version Compatibility

Reference examples tested with: numpy 1.26+, pandas 2.2+, scikit-learn 1.4+, boruta 0.4+, mrmr-selection 0.2+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures

Feature Selection for Biomarker Discovery

All-relevant (which genes carry signal?): BorutaPy(rf)
Minimal-optimal (smallest predictive set?): ElasticNetCV, LogisticRegressionCV(penalty='elasticnet')
Stability (does the list reproduce?): bootstrap selection frequencies + a stability index

The Single Most Important Modern Insight -- Most Gene Signatures Do Not Replicate, and Significance Is the Wrong Bar

All-Relevant vs Minimal-Optimal (the distinction usually conflated)

This axis matters more than filter/wrapper/embedded. Choosing the wrong one is the most common conceptual error in applied biomarker papers.

Minimal-optimal = the smallest subset giving optimal prediction (LASSO, RFE, forward selection). If two genes are correlated and both informative, it keeps one and drops the other; the dropped gene is still biologically relevant. Minimal-optimal sets are non-unique, unstable, and systematically exclude redundant-but-real features. Absence from a minimal-optimal set is not evidence of irrelevance.
All-relevant = every feature carrying information, redundant or not (Boruta: keep anything beating the best "shadow" permuted feature). This is the right framing for biological interpretation -- the whole co-expression module is wanted, not one representative.

Methods Taxonomy

Decision Tree by Scenario

Leakage-Safe Selection (the single most damaging error to avoid)

Goal: Estimate the performance of a selection-plus-model pipeline without optimistic bias.

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold

pipe = Pipeline([
    ('select', SelectKBest(f_classif, k=20)),               # re-fit per fold -> no leakage
    ('clf', LogisticRegression(penalty='l2', max_iter=5000)),
])
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
auc = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')   # honest estimate
print(f'Nested-safe AUC: {auc.mean():.3f} +/- {auc.std():.3f}')

All-Relevant: Boruta

Goal: Enumerate every feature carrying signal, including redundant co-expressed genes.

Approach: Compare each real feature's importance to the maximum importance of permuted "shadow" features over many iterations; confirm features that consistently beat the best shadow.

from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)
# perc=100 uses the max shadow importance (strict); two_step (default True) controls the multiple-testing correction.
boruta = BorutaPy(rf, n_estimators='auto', perc=100, two_step=True, max_iter=100, random_state=42)
boruta.fit(X.values, y.values)                              # numpy arrays, not pandas

confirmed = X.columns[boruta.support_]                      # all-relevant set (redundant by design)
tentative = X.columns[boruta.support_weak_]

Minimal-Optimal: Elastic Net (prefer over bare LASSO)

Goal: A small, stable predictive signature from correlated omics features.

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)               # for real scoring, do this inside the Pipeline
# saga is the only solver supporting elasticnet; C = 1/lambda (opposite of alpha in Lasso/ElasticNet).
enet = LogisticRegressionCV(penalty='elasticnet', solver='saga',
                            l1_ratios=[0.1, 0.5, 0.9], Cs=20, cv=10, max_iter=10000)
enet.fit(X_scaled, y)
selected = X.columns[enet.coef_[0] != 0]

Stability: Are the Selected Features Reproducible?

Goal: Distinguish a robust signature from a resampling accident, and report stability alongside accuracy.

import numpy as np
from sklearn.linear_model import LogisticRegression

n_subsample, p = 100, X.shape[1]
counts = np.zeros(p)
subsets = []
for _ in range(n_subsample):
    idx = np.random.choice(len(X), size=len(X) // 2, replace=False)   # n/2 subsampling
    fit = LogisticRegression(penalty='l1', solver='liblinear', C=0.1, max_iter=2000).fit(X.iloc[idx], y.iloc[idx])
    mask = fit.coef_[0] != 0
    counts += mask
    subsets.append(mask.astype(int))

stable = X.columns[counts / n_subsample > 0.6]             # pi_thr=0.6: Meinshausen-Buhlmann default
# Nogueira stability index (chance-corrected; 1 = identical selections, ~0 = random):
Z = np.array(subsets); pbar = Z.mean(axis=0); k = Z.sum(axis=1)
stability = 1 - (Z.var(axis=0, ddof=1).mean()) / ((k.mean() / p) * (1 - k.mean() / p))
print(f'{len(stable)} stable features; Nogueira stability = {stability:.2f}')

Per-Method Failure Modes

Interpreting minimal-optimal membership as biology

Trigger: Reporting "LASSO selected gene X but not its co-expressed partner Y" as a biological finding.
Mechanism: L1 geometry keeps one vertex of a correlated group arbitrarily; the choice flips across resamples.
Symptom: Selected genes change completely on a different train/test split though accuracy is stable.
Fix: Use elastic net (grouping effect) or report selection frequencies; never read membership as importance ordering.

Selection-before-CV leakage

Trigger: Pick top-k features on all samples, then cross-validate the classifier on those features.
Mechanism: The held-out folds informed which genes were kept; selection is the dominant overfitting capacity in p>>n.
Symptom: Near-perfect CV accuracy, even reproducible on label-permuted (null) data; collapse on an external cohort.
Fix: Selection lives inside the CV fold (Pipeline); estimate by nested CV (machine-learning/model-validation).

Significance against the wrong null

Trigger: Concluding a signature is real because it significantly predicts outcome.
Mechanism: Random gene sets clear that bar; the transcriptome's proliferation axis is captured by almost any large set (Venet 2011).
Symptom: The signature does not beat a size-matched random signature or a proliferation meta-gene in independent data.
Fix: Benchmark against random-signature and proliferation-meta-gene nulls; require added value over clinical covariates in an independent cohort.

Winner's curse / inflated effect sizes

Trigger: Estimating effect sizes or AUC on the same data used to select features.
Mechanism: Selected features are disproportionately those whose noise inflated their apparent effect (Goring 2001); the inflation can be near-total for small true effects.
Symptom: Discovery AUC much higher than replication; replication is under-powered because it was sized to the inflated effect.
Fix: Estimate effects on an independent split (cross-fitting / data-splitting); size replication for the shrunken effect.

Pseudoreplication in single-cell selection

Trigger: Treating thousands of cells from a few donors as independent samples.
Mechanism: Cells within a donor are correlated; the effective n is the number of donors.
Symptom: Grossly inflated significance and false discoveries.
Fix: Pseudobulk per donor, select at the donor level (Squair 2021); confront the small true n.

Reconciliation: When Methods Disagree

Quantitative Thresholds

Common Errors

References

Tibshirani R. 1996. Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267-288.
Goring HHH, Terwilliger JD, Blangero J. 2001. Large upward bias in estimation of locus-specific effects from genomewide scans. Am J Hum Genet 69:1357-1369.
Ambroise C, McLachlan GJ. 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99:6562-6566.
Simon R, Radmacher MD, Dobbin K, McShane LM. 2003. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95:14-18.
Ein-Dor L, Kela I, Getz G, Givol D, Domany E. 2005. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21:171-178.
Peng H, Long F, Ding C. 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226-1238.
Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. J R Stat Soc B 67:301-320.
Ein-Dor L, Zuk O, Domany E. 2006. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. PNAS 103:5923-5928.
Kursa MB, Rudnicki WR. 2010. Feature selection with the Boruta package. J Stat Softw 36:1-13.
Meinshausen N, Buhlmann P. 2010. Stability selection. J R Stat Soc B 72:417-473.
Venet D, Dumont JE, Detours V. 2011. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol 7:e1002240.
Kern SE. 2012. Why your new cancer biomarker may never work: recurrent patterns and remarkable diversity in biomarker failures. Cancer Res 72:6097-6101.
Shah RD, Samworth RJ. 2013. Variable selection with error control: another look at stability selection. J R Stat Soc B 75:55-80.
Nogueira S, Sechidis K, Brown G. 2018. On the stability of feature selection algorithms. J Mach Learn Res 18:1-54.
Squair JW, Gautier M, Kathe C, et al. 2021. Confronting false discoveries in single-cell differential expression. Nat Commun 12:5692.

Related Skills

machine-learning/model-validation - Nested CV and leakage-safe estimation of the selected model
machine-learning/prediction-explanation - Why SHAP rankings are not a validated selection method
machine-learning/omics-classifiers - Build a classifier from the selected features
differential-expression/de-results - Pre-filter candidates with differential expression
experimental-design/multiple-testing - FDR control and why it is orthogonal to selection stability
experimental-design/power-analysis - Sample size for a stable signature vs an accurate predictor
pathway-analysis/go-enrichment - Functional enrichment of an all-relevant gene set

Adoption

GPTomics/bio-machine-learning-biomarker-discovery

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Feature Selection for Biomarker Discovery

The Single Most Important Modern Insight -- Most Gene Signatures Do Not Replicate, and Significance Is the Wrong Bar

All-Relevant vs Minimal-Optimal (the distinction usually conflated)

Methods Taxonomy

Decision Tree by Scenario

Leakage-Safe Selection (the single most damaging error to avoid)

All-Relevant: Boruta

Minimal-Optimal: Elastic Net (prefer over bare LASSO)

Stability: Are the Selected Features Reproducible?

Per-Method Failure Modes

Interpreting minimal-optimal membership as biology

Selection-before-CV leakage

Significance against the wrong null

Winner's curse / inflated effect sizes

Pseudoreplication in single-cell selection

Reconciliation: When Methods Disagree

Quantitative Thresholds

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-machine-learning-biomarker-discovery

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Feature Selection for Biomarker Discovery

The Single Most Important Modern Insight -- Most Gene Signatures Do Not Replicate, and Significance Is the Wrong Bar

All-Relevant vs Minimal-Optimal (the distinction usually conflated)

Methods Taxonomy

Decision Tree by Scenario

Leakage-Safe Selection (the single most damaging error to avoid)

All-Relevant: Boruta

Minimal-Optimal: Elastic Net (prefer over bare LASSO)

Stability: Are the Selected Features Reproducible?

Per-Method Failure Modes

Interpreting minimal-optimal membership as biology

Selection-before-CV leakage

Significance against the wrong null

Winner's curse / inflated effect sizes

Pseudoreplication in single-cell selection

Reconciliation: When Methods Disagree

Quantitative Thresholds

Common Errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis