Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran.
Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Velenjak, 1985717413 Tehran, Iran.
J Clin Epidemiol. 2016 Mar;71:76-85. doi: 10.1016/j.jclinepi.2015.10.002. Epub 2015 Oct 22.
Identifying an appropriate set of predictors for the outcome of interest is a major challenge in clinical prediction research. The aim of this study was to show the application of some variable selection methods, usually used in data mining, for an epidemiological study. We introduce here a systematic approach.
The P-value-based method, usually used in epidemiological studies, and several filter and wrapper methods were implemented to select the predictors of diabetes among 55 variables in 803 prediabetic females, aged ≥ 20 years, followed for 10-12 years. To develop a logistic model, variables were selected from a train data set and evaluated on the test data set. The measures of Akaike information criterion (AIC) and area under the curve (AUC) were used as performance criteria. We also implemented a full model with all 55 variables.
We found that the worst and the best models were the full model and models based on the wrappers, respectively. Among filter methods, symmetrical uncertainty gave both the best AUC and AIC.
Our experiment showed that the variable selection methods used in data mining could improve the performance of clinical prediction models. An R program was developed to make these methods more feasible and visualize the results.
确定与感兴趣结局相关的合适预测因子集是临床预测研究中的主要挑战。本研究旨在展示一些通常用于数据挖掘的变量选择方法在流行病学研究中的应用。我们在这里介绍一种系统的方法。
本研究采用基于 P 值的方法(通常用于流行病学研究)和几种筛选器和封装器方法,从 803 名年龄≥20 岁的糖尿病前期女性中筛选出 55 个变量中的预测因子,随访 10-12 年。为了开发逻辑回归模型,从训练数据集中选择变量,并在测试数据集中评估。采用赤池信息量准则(AIC)和曲线下面积(AUC)作为性能标准。我们还建立了包含所有 55 个变量的全模型。
我们发现最差和最好的模型分别是全模型和基于封装器的模型。在筛选器方法中,对称不确定性得到了最佳的 AUC 和 AIC。
我们的实验表明,数据挖掘中使用的变量选择方法可以提高临床预测模型的性能。我们开发了一个 R 程序,使这些方法更加可行,并可视化结果。