• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于高维数据预测模型构建的重复筛选

Repeated Sieving for Prediction Model Building with High-Dimensional Data.

作者信息

Liu Lu, Jung Sin-Ho

机构信息

Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA.

出版信息

J Pers Med. 2024 Jul 19;14(7):769. doi: 10.3390/jpm14070769.

DOI:10.3390/jpm14070769
PMID:39064023
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11277592/
Abstract

: The prediction of patients' outcomes is a key component in personalized medicine. Oftentimes, a prediction model is developed using a large number of candidate predictors, called high-dimensional data, including genomic data, lab tests, electronic health records, etc. Variable selection, also called dimension reduction, is a critical step in developing a prediction model using high-dimensional data. : In this paper, we compare the variable selection and prediction performance of popular machine learning (ML) methods with our proposed method. LASSO is a popular ML method that selects variables by imposing an L1-norm penalty to the likelihood. By this approach, LASSO selects features based on the size of regression estimates, rather than their statistical significance. As a result, LASSO can miss significant features while it is known to over-select features. Elastic net (EN), another popular ML method, tends to select even more features than LASSO since it uses a combination of L1- and L2-norm penalties that is less strict than an L1-norm penalty. Insignificant features included in a fitted prediction model act like white noises, so that the fitted model will lose prediction accuracy. Furthermore, for the future use of a fitted prediction model, we have to collect the data of all the features included in the model, which will cost a lot and possibly lower the accuracy of the data if the number of features is too many. Therefore, we propose an ML method, called repeated sieving, extending the standard regression methods with stepwise variable selection. By selecting features based on their statistical significance, it resolves the over-selection issue with high-dimensional data. : Through extensive numerical studies and real data examples, our results show that the repeated sieving method selects far fewer features than LASSO and EN, but has higher prediction accuracy than the existing ML methods. : We conclude that our repeated sieving method performs well in both variable selection and prediction, and it saves the cost of future investigation on the selected factors.

摘要

患者预后的预测是个性化医疗的关键组成部分。通常情况下,预测模型是使用大量候选预测因子(即所谓的高维数据)来开发的,这些数据包括基因组数据、实验室检查、电子健康记录等。变量选择,也称为降维,是使用高维数据开发预测模型的关键步骤。

在本文中,我们将流行的机器学习(ML)方法的变量选择和预测性能与我们提出的方法进行了比较。LASSO是一种流行的ML方法,它通过对似然施加L1范数惩罚来选择变量。通过这种方法,LASSO根据回归估计的大小来选择特征,而不是它们的统计显著性。因此,LASSO可能会错过显著特征,同时已知它会过度选择特征。弹性网(EN)是另一种流行的ML方法,由于它使用了比L1范数惩罚更宽松的L1和L2范数惩罚的组合,因此往往比LASSO选择更多的特征。拟合预测模型中包含的无关特征就像白噪声一样,因此拟合模型会失去预测准确性。此外,对于拟合预测模型的未来使用,我们必须收集模型中包含的所有特征的数据,如果特征数量太多,这将花费大量成本,并且可能会降低数据的准确性。因此,我们提出了一种称为重复筛选的ML方法,它扩展了具有逐步变量选择的标准回归方法。通过根据特征的统计显著性来选择特征,它解决了高维数据的过度选择问题。

通过广泛的数值研究和实际数据示例,我们的结果表明,重复筛选方法选择的特征比LASSO和EN少得多,但预测准确性比现有ML方法更高。

我们得出结论,我们的重复筛选方法在变量选择和预测方面都表现良好,并且节省了未来对所选因素进行调查所需的成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7ff9/11277592/2dcffe96c53c/jpm-14-00769-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7ff9/11277592/2dcffe96c53c/jpm-14-00769-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7ff9/11277592/2dcffe96c53c/jpm-14-00769-g001.jpg

相似文献

1
Repeated Sieving for Prediction Model Building with High-Dimensional Data.用于高维数据预测模型构建的重复筛选
J Pers Med. 2024 Jul 19;14(7):769. doi: 10.3390/jpm14070769.
2
Stabilizing l1-norm prediction models by supervised feature grouping.通过监督特征分组来稳定l1范数预测模型。
J Biomed Inform. 2016 Feb;59:149-68. doi: 10.1016/j.jbi.2015.11.012. Epub 2015 Dec 9.
3
High-dimensional Cox models: the choice of penalty as part of the model building process.高维Cox模型:作为模型构建过程一部分的惩罚项选择
Biom J. 2010 Feb;52(1):50-69. doi: 10.1002/bimj.200900064.
4
Stable feature selection for clinical prediction: exploiting ICD tree structure using Tree-Lasso.用于临床预测的稳定特征选择:利用树套索法挖掘国际疾病分类树结构
J Biomed Inform. 2015 Feb;53:277-90. doi: 10.1016/j.jbi.2014.11.013. Epub 2014 Dec 9.
5
Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models.在高维惩罚 Cox 回归模型中考虑分组预测变量或途径。
BMC Bioinformatics. 2020 Jul 2;21(1):277. doi: 10.1186/s12859-020-03618-y.
6
Improving the Robustness of Variable Selection and Predictive Performance of Regularized Generalized Linear Models and Cox Proportional Hazard Models.提高正则化广义线性模型和Cox比例风险模型变量选择的稳健性及预测性能。
Mathematics (Basel). 2023 Feb;11(3). doi: 10.3390/math11030557. Epub 2023 Jan 20.
7
IPF-LASSO: Integrative -Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data.IPF-LASSO:基于多组学数据的带惩罚因子的整合惩罚回归用于预测
Comput Math Methods Med. 2017;2017:7691937. doi: 10.1155/2017/7691937. Epub 2017 May 4.
8
Identification of clinically relevant features in hypertensive patients using penalized regression: a case study of cardiovascular events.使用惩罚回归识别高血压患者的临床相关特征:心血管事件的案例研究。
Med Biol Eng Comput. 2019 Sep;57(9):2011-2026. doi: 10.1007/s11517-019-02007-9. Epub 2019 Jul 25.
9
Applications of Bayesian shrinkage prior models in clinical research with categorical responses.贝叶斯收缩先验模型在分类反应临床研究中的应用。
BMC Med Res Methodol. 2022 Apr 28;22(1):126. doi: 10.1186/s12874-022-01560-6.
10
Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification.基于 L1/2 罚项的稀疏逻辑回归在癌症分类中的基因选择。
BMC Bioinformatics. 2013 Jun 19;14:198. doi: 10.1186/1471-2105-14-198.

本文引用的文献

1
VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research.VSOLassoBag:一种面向变量选择的LASSO套袋算法,用于基于组学的转化研究中的生物标志物发现。
J Genet Genomics. 2023 Mar;50(3):151-162. doi: 10.1016/j.jgg.2022.12.005. Epub 2023 Jan 3.
2
Comparing Machine Learning to Regression Methods for Mortality Prediction Using Veterans Affairs Electronic Health Record Clinical Data.使用退伍军人事务部电子健康记录临床数据比较机器学习与回归方法进行死亡率预测。
Med Care. 2022 Jun 1;60(6):470-479. doi: 10.1097/MLR.0000000000001720. Epub 2022 Mar 30.
3
Comparison of machine learning and logistic regression models in predicting acute kidney injury: A systematic review and meta-analysis.
机器学习和逻辑回归模型在预测急性肾损伤中的比较:系统评价和荟萃分析。
Int J Med Inform. 2021 Jul;151:104484. doi: 10.1016/j.ijmedinf.2021.104484. Epub 2021 May 8.
4
Use of Machine Learning Models to Predict Death After Acute Myocardial Infarction.利用机器学习模型预测急性心肌梗死后的死亡。
JAMA Cardiol. 2021 Jun 1;6(6):633-641. doi: 10.1001/jamacardio.2021.0122.
5
Incremental Benefits of Machine Learning-When Do We Need a Better Mousetrap?机器学习的增量效益——我们何时需要一个更好的捕鼠器?
JAMA Cardiol. 2021 Jun 1;6(6):621-623. doi: 10.1001/jamacardio.2021.0139.
6
A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.系统评价显示,机器学习在临床预测模型中并未优于逻辑回归。
J Clin Epidemiol. 2019 Jun;110:12-22. doi: 10.1016/j.jclinepi.2019.02.004. Epub 2019 Feb 11.
7
Comparison of logistic regression with machine learning methods for the prediction of fetal growth abnormalities: a retrospective cohort study.逻辑回归与机器学习方法预测胎儿生长异常的比较:一项回顾性队列研究。
BMC Pregnancy Childbirth. 2018 Aug 15;18(1):333. doi: 10.1186/s12884-018-1971-2.
8
Logistic Regression: Relating Patient Characteristics to Outcomes.逻辑回归:将患者特征与预后相关联。
JAMA. 2016 Aug 2;316(5):533-4. doi: 10.1001/jama.2016.7653.
9
Nanostring-based multigene assay to predict recurrence for gastric cancer patients after surgery.基于纳米孔道技术的多基因检测预测胃癌患者术后复发情况
PLoS One. 2014 Mar 5;9(3):e90133. doi: 10.1371/journal.pone.0090133. eCollection 2014.
10
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer.预测淋巴结阴性原发性乳腺癌远处转移的基因表达谱。
Lancet. 2005;365(9460):671-9. doi: 10.1016/S0140-6736(05)17947-1.