Alghamdi Manal, Al-Mallah Mouaz, Keteyian Steven, Brawner Clinton, Ehrman Jonathan, Sakr Sherif
King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia.
King Abdullah International Medical Research Center, Riyadh, Saudia Arabia.
PLoS One. 2017 Jul 24;12(7):e0179805. doi: 10.1371/journal.pone.0179805. eCollection 2017.
Machine learning is becoming a popular and important approach in the field of medical research. In this study, we investigate the relative performance of various machine learning methods such as Decision Tree, Naïve Bayes, Logistic Regression, Logistic Model Tree and Random Forests for predicting incident diabetes using medical records of cardiorespiratory fitness. In addition, we apply different techniques to uncover potential predictors of diabetes. This FIT project study used data of 32,555 patients who are free of any known coronary artery disease or heart failure who underwent clinician-referred exercise treadmill stress testing at Henry Ford Health Systems between 1991 and 2009 and had a complete 5-year follow-up. At the completion of the fifth year, 5,099 of those patients have developed diabetes. The dataset contained 62 attributes classified into four categories: demographic characteristics, disease history, medication use history, and stress test vital signs. We developed an Ensembling-based predictive model using 13 attributes that were selected based on their clinical importance, Multiple Linear Regression, and Information Gain Ranking methods. The negative effect of the imbalance class of the constructed model was handled by Synthetic Minority Oversampling Technique (SMOTE). The overall performance of the predictive model classifier was improved by the Ensemble machine learning approach using the Vote method with three Decision Trees (Naïve Bayes Tree, Random Forest, and Logistic Model Tree) and achieved high accuracy of prediction (AUC = 0.92). The study shows the potential of ensembling and SMOTE approaches for predicting incident diabetes using cardiorespiratory fitness data.
机器学习正在成为医学研究领域一种流行且重要的方法。在本研究中,我们调查了各种机器学习方法(如决策树、朴素贝叶斯、逻辑回归、逻辑模型树和随机森林)利用心肺适能的医疗记录预测糖尿病发病情况的相对性能。此外,我们应用不同技术来揭示糖尿病的潜在预测因素。这项FIT项目研究使用了32555名患者的数据,这些患者没有任何已知的冠状动脉疾病或心力衰竭,于1991年至2009年在亨利福特健康系统接受了临床医生推荐的运动平板压力测试,并进行了完整的5年随访。在第五年末,这些患者中有5099人患上了糖尿病。该数据集包含62个属性,分为四类:人口统计学特征、病史、用药史和压力测试生命体征。我们基于13个根据其临床重要性、多元线性回归和信息增益排序方法选择的属性开发了一种基于集成的预测模型。通过合成少数过采样技术(SMOTE)处理了构建模型的不平衡类的负面影响。使用具有三个决策树(朴素贝叶斯树、随机森林和逻辑模型树)的投票方法的集成机器学习方法提高了预测模型分类器的整体性能,并实现了较高的预测准确率(AUC = 0.92)。该研究展示了使用集成方法和SMOTE方法利用心肺适能数据预测糖尿病发病情况的潜力。