Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany.
Int J Biostat. 2022 Aug 10;19(1):111-129. doi: 10.1515/ijb-2021-0127. eCollection 2023 May 1.
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
我们以自适应的方式将稳健的损失函数与统计提升算法相结合,用于对潜在的高维生物医学数据进行变量选择和预测建模。为了抵抗因变量(垂直离群值)中的异常值,我们考虑了不同的复合稳健损失函数与线性回归的基础学习器相结合。对于复合损失函数,如 Huber 损失和 Bisquare 损失,必须指定一个阈值参数来控制稳健性。在提升算法的背景下,我们提出了一种方法,即在每个迭代中根据固定分位数水平,根据当前残差的大小自适应地调整复合稳健损失的阈值参数。我们将我们的方法与经典 M 回归、使用标准损失函数的提升或lasso 进行比较,以评估不同模拟环境下的预测准确性和变量选择:当因变量包含异常值或受到特定类型的污染时,自适应 Huber 和 Bisquare 损失会导致更好的性能。对于未受污染的数据,我们的方法与高效的损失函数或lasso 的提升产生了类似的性能。在基于人类癌细胞系基因表达测量的偏态 KRT19 蛋白表达数据的分析中(NCI-60 细胞系面板),与标准损失函数或竞争稳健方法相比,使用新的自适应损失函数的提升在预测准确性方面表现良好,并产生了非常稀疏的模型。