Melikechi Omar, Dunson David B, Miller Jeffrey W
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, United States.
Department of Statistical Science, Duke University, Durham, NC, 27708, United States.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf299.
Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives.
We introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than P-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 s when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.
All code and data used in this work are available on GitHub (https://github.com/omelikechi/ipss_bioinformatics) and permanently archived on Zenodo (https://doi.org/10.5281/zenodo.15335289). A Python package for implementing IPSS is available on GitHub (https://github.com/omelikechi/ipss) and PyPI (https://pypi.org/project/ipss/). An R implementation of IPSS is also available on GitHub (https://github.com/omelikechi/ipssR).
特征选择是机器学习和统计学中的一项关键任务。然而,现有的特征选择方法要么(i)依赖于参数方法,如线性或广义线性模型;(ii)缺乏理论上的错误发现控制;要么(iii)识别出的真阳性较少。
我们引入了一种基于将集成路径稳定性选择(IPSS)应用于任意特征重要性得分的具有有限样本错误发现控制的通用特征选择方法。只要重要性得分是非参数的,该方法就是非参数的,并且它估计q值,与P值相比,q值更适合高维数据。我们重点关注使用梯度提升(IPSSGB)和随机森林(IPSSRF)的重要性得分的两种特殊情况。对RNA测序数据进行的广泛非线性模拟表明,这两种方法都能准确控制错误发现率,并且比现有方法检测到更多的真阳性。这两种方法也都很高效,当有500个样本和5000个特征时,运行时间不到20秒。我们应用IPSSGB和IPSSRF来检测与癌症相关的 microRNA 和基因,发现它们用比现有方法更少的特征就能产生更好的预测。