用于高维数据预测模型构建的重复筛选

Repeated Sieving for Prediction Model Building with High-Dimensional Data.

作者信息

Liu Lu, Jung Sin-Ho

机构信息

Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA.

出版信息

J Pers Med. 2024 Jul 19;14(7):769. doi: 10.3390/jpm14070769.

DOI:10.3390/jpm14070769

PMID:39064023

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11277592/

Abstract

: The prediction of patients' outcomes is a key component in personalized medicine. Oftentimes, a prediction model is developed using a large number of candidate predictors, called high-dimensional data, including genomic data, lab tests, electronic health records, etc. Variable selection, also called dimension reduction, is a critical step in developing a prediction model using high-dimensional data. : In this paper, we compare the variable selection and prediction performance of popular machine learning (ML) methods with our proposed method. LASSO is a popular ML method that selects variables by imposing an L1-norm penalty to the likelihood. By this approach, LASSO selects features based on the size of regression estimates, rather than their statistical significance. As a result, LASSO can miss significant features while it is known to over-select features. Elastic net (EN), another popular ML method, tends to select even more features than LASSO since it uses a combination of L1- and L2-norm penalties that is less strict than an L1-norm penalty. Insignificant features included in a fitted prediction model act like white noises, so that the fitted model will lose prediction accuracy. Furthermore, for the future use of a fitted prediction model, we have to collect the data of all the features included in the model, which will cost a lot and possibly lower the accuracy of the data if the number of features is too many. Therefore, we propose an ML method, called repeated sieving, extending the standard regression methods with stepwise variable selection. By selecting features based on their statistical significance, it resolves the over-selection issue with high-dimensional data. : Through extensive numerical studies and real data examples, our results show that the repeated sieving method selects far fewer features than LASSO and EN, but has higher prediction accuracy than the existing ML methods. : We conclude that our repeated sieving method performs well in both variable selection and prediction, and it saves the cost of future investigation on the selected factors.

摘要

患者预后的预测是个性化医疗的关键组成部分。通常情况下，预测模型是使用大量候选预测因子（即所谓的高维数据）来开发的，这些数据包括基因组数据、实验室检查、电子健康记录等。变量选择，也称为降维，是使用高维数据开发预测模型的关键步骤。

在本文中，我们将流行的机器学习（ML）方法的变量选择和预测性能与我们提出的方法进行了比较。LASSO是一种流行的ML方法，它通过对似然施加L1范数惩罚来选择变量。通过这种方法，LASSO根据回归估计的大小来选择特征，而不是它们的统计显著性。因此，LASSO可能会错过显著特征，同时已知它会过度选择特征。弹性网（EN）是另一种流行的ML方法，由于它使用了比L1范数惩罚更宽松的L1和L2范数惩罚的组合，因此往往比LASSO选择更多的特征。拟合预测模型中包含的无关特征就像白噪声一样，因此拟合模型会失去预测准确性。此外，对于拟合预测模型的未来使用，我们必须收集模型中包含的所有特征的数据，如果特征数量太多，这将花费大量成本，并且可能会降低数据的准确性。因此，我们提出了一种称为重复筛选的ML方法，它扩展了具有逐步变量选择的标准回归方法。通过根据特征的统计显著性来选择特征，它解决了高维数据的过度选择问题。

通过广泛的数值研究和实际数据示例，我们的结果表明，重复筛选方法选择的特征比LASSO和EN少得多，但预测准确性比现有ML方法更高。

我们得出结论，我们的重复筛选方法在变量选择和预测方面都表现良好，并且节省了未来对所选因素进行调查所需的成本。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于高维数据预测模型构建的重复筛选

Repeated Sieving for Prediction Model Building with High-Dimensional Data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

用于高维数据预测模型构建的重复筛选

Repeated Sieving for Prediction Model Building with High-Dimensional Data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献