Biostatistics Department, Princess Margaret Cancer Research Centre, Toronto, Ontario, Canada.
Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
PLoS One. 2021 Feb 16;16(2):e0246159. doi: 10.1371/journal.pone.0246159. eCollection 2021.
Feature selection on high dimensional data along with the interaction effects is a critical challenge for classical statistical learning techniques. Existing feature selection algorithms such as random LASSO leverages LASSO capability to handle high dimensional data. However, the technique has two main limitations, namely the inability to consider interaction terms and the lack of a statistical test for determining the significance of selected features. This study proposes a High Dimensional Selection with Interactions (HDSI) algorithm, a new feature selection method, which can handle high-dimensional data, incorporate interaction terms, provide the statistical inferences of selected features and leverage the capability of existing classical statistical techniques. The method allows the application of any statistical technique like LASSO and subset selection on multiple bootstrapped samples; each contains randomly selected features. Each bootstrap data incorporates interaction terms for the randomly sampled features. The selected features from each model are pooled and their statistical significance is determined. The selected statistically significant features are used as the final output of the approach, whose final coefficients are estimated using appropriate statistical techniques. The performance of HDSI is evaluated using both simulated data and real studies. In general, HDSI outperforms the commonly used algorithms such as LASSO, subset selection, adaptive LASSO, random LASSO and group LASSO.
在高维数据上进行特征选择以及交互作用是经典统计学习技术的一个关键挑战。现有的特征选择算法,如随机 LASSO,利用 LASSO 能力来处理高维数据。然而,该技术有两个主要的局限性,即无法考虑交互项,也缺乏用于确定所选特征重要性的统计检验。本研究提出了一种新的特征选择方法——高维交互选择(HDSI)算法,它可以处理高维数据,纳入交互项,为所选特征提供统计推断,并利用现有经典统计技术的能力。该方法允许在多个自举样本上应用任何统计技术,如 LASSO 和子集选择;每个样本都包含随机选择的特征。每个自举数据都包含随机采样特征的交互项。从每个模型中选择的特征被汇集在一起,并确定其统计显著性。选择具有统计学意义的特征作为该方法的最终输出,其最终系数使用适当的统计技术进行估计。HDSI 的性能通过模拟数据和真实研究进行评估。总的来说,HDSI 优于常用的算法,如 LASSO、子集选择、自适应 LASSO、随机 LASSO 和组 LASSO。