一篇关于临床预测模型变量选择的教程：数据挖掘中的特征选择方法可以改善结果。

A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results.

机构信息

Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran.

Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran.

出版信息

J Clin Epidemiol. 2016 Mar;71:76-85. doi: 10.1016/j.jclinepi.2015.10.002. Epub 2015 Oct 22.

DOI:10.1016/j.jclinepi.2015.10.002

PMID:26475568

Abstract

OBJECTIVES

Identifying an appropriate set of predictors for the outcome of interest is a major challenge in clinical prediction research. The aim of this study was to show the application of some variable selection methods, usually used in data mining, for an epidemiological study. We introduce here a systematic approach.

STUDY DESIGN AND SETTING

The P-value-based method, usually used in epidemiological studies, and several filter and wrapper methods were implemented to select the predictors of diabetes among 55 variables in 803 prediabetic females, aged ≥ 20 years, followed for 10-12 years. To develop a logistic model, variables were selected from a train data set and evaluated on the test data set. The measures of Akaike information criterion (AIC) and area under the curve (AUC) were used as performance criteria. We also implemented a full model with all 55 variables.

RESULTS

We found that the worst and the best models were the full model and models based on the wrappers, respectively. Among filter methods, symmetrical uncertainty gave both the best AUC and AIC.

CONCLUSION

Our experiment showed that the variable selection methods used in data mining could improve the performance of clinical prediction models. An R program was developed to make these methods more feasible and visualize the results.

摘要

目的

确定与感兴趣结局相关的合适预测因子集是临床预测研究中的主要挑战。本研究旨在展示一些通常用于数据挖掘的变量选择方法在流行病学研究中的应用。我们在这里介绍一种系统的方法。

设计和设置

本研究采用基于 P 值的方法（通常用于流行病学研究）和几种筛选器和封装器方法，从 803 名年龄≥20 岁的糖尿病前期女性中筛选出 55 个变量中的预测因子，随访 10-12 年。为了开发逻辑回归模型，从训练数据集中选择变量，并在测试数据集中评估。采用赤池信息量准则（AIC）和曲线下面积（AUC）作为性能标准。我们还建立了包含所有 55 个变量的全模型。

结果

我们发现最差和最好的模型分别是全模型和基于封装器的模型。在筛选器方法中，对称不确定性得到了最佳的 AUC 和 AIC。

结论

我们的实验表明，数据挖掘中使用的变量选择方法可以提高临床预测模型的性能。我们开发了一个 R 程序，使这些方法更加可行，并可视化结果。

相似文献

A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results.一篇关于临床预测模型变量选择的教程：数据挖掘中的特征选择方法可以改善结果。

J Clin Epidemiol. 2016 Mar;71:76-85. doi: 10.1016/j.jclinepi.2015.10.002. Epub 2015 Oct 22.

Predicting the graft survival for heart-lung transplantation patients: an integrated data mining methodology.预测心肺移植患者的移植物存活率：一种综合数据挖掘方法。

Int J Med Inform. 2009 Dec;78(12):e84-96. doi: 10.1016/j.ijmedinf.2009.04.007. Epub 2009 Jun 3.

Prediction of thoracic injury severity in frontal impacts by selected anatomical morphomic variables through model-averaged logistic regression approach.通过模型平均逻辑回归方法，利用选定的解剖形态变量预测正面碰撞中的胸部损伤严重程度。

Accid Anal Prev. 2013 Nov;60:172-80. doi: 10.1016/j.aap.2013.08.020. Epub 2013 Sep 5.

Seminal quality prediction using data mining methods.使用数据挖掘方法进行精液质量预测。

Technol Health Care. 2014;22(4):531-45. doi: 10.3233/THC-140816.

An empirical approach to model selection through validation for censored survival data.基于验证的删失生存数据分析中模型选择的经验方法。

J Biomed Inform. 2011 Aug;44(4):595-606. doi: 10.1016/j.jbi.2011.02.005. Epub 2011 Feb 16.

Selecting the embryo with the highest implantation potential using a data mining based prediction model.使用基于数据挖掘的预测模型选择具有最高着床潜力的胚胎。

Reprod Biol Endocrinol. 2016 Mar 3;14:10. doi: 10.1186/s12958-016-0145-1.

The cross-validated AUC for MCP-logistic regression with high-dimensional data.高维数据下 MCP-logistic 回归的交叉验证 AUC。

Stat Methods Med Res. 2013 Oct;22(5):505-18. doi: 10.1177/0962280211428385. Epub 2011 Nov 28.

A novel feature selection approach for biomedical data classification.一种用于生物医学数据分类的新特征选择方法。

J Biomed Inform. 2010 Feb;43(1):15-23. doi: 10.1016/j.jbi.2009.07.008. Epub 2009 Jul 30.

A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining.基于最大信息系数和 Gram-Schmidt 正交化的生物医学数据挖掘过滤特征选择方法。

Comput Biol Med. 2017 Oct 1;89:264-274. doi: 10.1016/j.compbiomed.2017.08.021. Epub 2017 Aug 24.

Comparison of three data mining models for predicting diabetes or prediabetes by risk factors.三种数据挖掘模型预测糖尿病或糖尿病前期的危险因素比较。

Kaohsiung J Med Sci. 2013 Feb;29(2):93-9. doi: 10.1016/j.kjms.2012.08.016. Epub 2012 Oct 16.

引用本文的文献

Variable selection methods for descriptive modeling.用于描述性建模的变量选择方法。

PLoS One. 2025 Jun 2;20(6):e0321601. doi: 10.1371/journal.pone.0321601. eCollection 2025.

How digital therapeutic alliances influence the perceived helpfulness of online mental health Q&A: An explainable machine learning approach.数字治疗联盟如何影响在线心理健康问答的感知帮助性：一种可解释的机器学习方法。

Digit Health. 2025 May 5;11:20552076251333480. doi: 10.1177/20552076251333480. eCollection 2025 Jan-Dec.

The application of artificial intelligence in upper gastrointestinal cancers.人工智能在上消化道癌症中的应用。

J Natl Cancer Cent. 2024 Dec 27;5(2):113-131. doi: 10.1016/j.jncc.2024.12.006. eCollection 2025 Apr.

Machine learning and artificial intelligence in type 2 diabetes prediction: a comprehensive 33-year bibliometric and literature analysis.机器学习与人工智能在2型糖尿病预测中的应用：一项为期33年的全面文献计量学与文献分析

Front Digit Health. 2025 Mar 27;7:1557467. doi: 10.3389/fdgth.2025.1557467. eCollection 2025.

Developing a Machine Learning Model for Predicting 30-Day Major Adverse Cardiac and Cerebrovascular Events in Patients Undergoing Noncardiac Surgery: Retrospective Study.开发用于预测非心脏手术患者30天主要不良心脑血管事件的机器学习模型：回顾性研究

J Med Internet Res. 2025 Apr 9;27:e66366. doi: 10.2196/66366.

Radiomics in breast cancer: Current advances and future directions.乳腺癌放射组学：当前进展与未来方向

Cell Rep Med. 2024 Sep 17;5(9):101719. doi: 10.1016/j.xcrm.2024.101719.

Development of a diagnostic predictive model for determining child stunting in Malawi: a comparative analysis of variable selection approaches.开发马拉维儿童发育迟缓诊断预测模型：变量选择方法的比较分析。

BMC Med Res Methodol. 2024 Aug 8;24(1):175. doi: 10.1186/s12874-024-02283-6.

A Method to Explore the Best Mixed-Effects Model in a Data-Driven Manner with Multiprocessing: Applications in Public Health Research.一种通过多进程以数据驱动方式探索最佳混合效应模型的方法：在公共卫生研究中的应用

Eur J Investig Health Psychol Educ. 2024 May 10;14(5):1338-1350. doi: 10.3390/ejihpe14050088.

The Applications of Artificial Intelligence in Digestive System Neoplasms: A Review.人工智能在消化系统肿瘤中的应用：综述

Health Data Sci. 2023 Feb 6;3:0005. doi: 10.34133/hds.0005. eCollection 2023.

Time-dependent systolic blood pressure within 72 h after endovascular treatment in large vessel occlusion stroke.大血管闭塞性脑卒中血管内治疗后 72 h 内的时间依赖性收缩压。

Brain Behav. 2024 Mar;14(3):e3442. doi: 10.1002/brb3.3442.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一篇关于临床预测模型变量选择的教程：数据挖掘中的特征选择方法可以改善结果。

A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results.

机构信息

出版信息

OBJECTIVES

STUDY DESIGN AND SETTING

RESULTS

CONCLUSION

目的

设计和设置

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献