Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
BMC Bioinformatics. 2021 Sep 16;22(1):441. doi: 10.1186/s12859-021-04340-z.
Statistical boosting is a computational approach to select and estimate interpretable prediction models for high-dimensional biomedical data, leading to implicit regularization and variable selection when combined with early stopping. Traditionally, the set of base-learners is fixed for all iterations and consists of simple regression learners including only one predictor variable at a time. Furthermore, the number of iterations is typically tuned by optimizing the predictive performance, leading to models which often include unnecessarily large numbers of noise variables.
We propose three consecutive extensions of classical component-wise gradient boosting. In the first extension, called Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. To compensate for the larger flexibility, the ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm. As the second extension, Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data. In a third extension, called Adaptive Subspace Boosting (AdaSubBoost), an adaptive random preselection of base-learners is considered, focusing on base-learners which have proven to be predictive in previous iterations. Simulation results show that the multivariable updates in the three subspace algorithms are particularly beneficial in cases of high correlations among signal covariates. In several biomedical applications the proposed algorithms tend to yield sparser models than classical statistical boosting, while showing a very competitive predictive performance also compared to penalized regression approaches like the (relaxed) lasso and the elastic net.
The proposed randomized boosting approaches with multivariable base-learners are promising extensions of statistical boosting, particularly suited for highly-correlated and sparse high-dimensional settings. The incorporated selection of base-learners via information criteria induces automatic stopping of the algorithms, promoting sparser and more interpretable prediction models.
统计提升是一种针对高维生物医学数据选择和估计可解释预测模型的计算方法,当与早期停止结合使用时,会导致隐式正则化和变量选择。传统上,基学习器的集合在所有迭代中都是固定的,并且由仅包含一个预测变量的简单回归学习器组成。此外,迭代次数通常通过优化预测性能进行调整,导致模型通常包含不必要的大量噪声变量。
我们提出了对经典逐分量梯度提升的三个连续扩展。在第一个扩展中,称为子空间提升(SubBoost),基学习器可以由多个变量组成,允许在单个迭代中进行多变量更新。为了补偿更大的灵活性,基学习器的最终选择基于信息准则,从而导致算法自动停止。作为第二个扩展,随机子空间提升(RSubBoost)在每个迭代中还包括基学习器的随机预选,从而实现了对高维数据的可扩展性。在第三个扩展中,称为自适应子空间提升(AdaSubBoost),考虑了基学习器的自适应随机预选,重点是在以前的迭代中证明具有预测能力的基学习器。模拟结果表明,在信号协变量之间存在高度相关性的情况下,三个子空间算法中的多变量更新特别有益。在几个生物医学应用中,与松弛lasso 和弹性网等惩罚回归方法相比,所提出的算法倾向于产生更稀疏的模型,同时表现出非常有竞争力的预测性能。
提出的具有多变量基学习器的随机化提升方法是统计提升的有前途的扩展,特别适合高度相关和稀疏的高维环境。通过信息准则选择基学习器会自动停止算法,从而促进更稀疏和更具可解释性的预测模型。