Department of Information and Communications Engineering, Myongji University, 116 Myongji-ro, Yongin, Gyeonggi 17058, Korea.
Int J Environ Res Public Health. 2021 Mar 23;18(6):3317. doi: 10.3390/ijerph18063317.
Prediction of type 2 diabetes (T2D) occurrence allows a person at risk to take actions that can prevent onset or delay the progression of the disease. In this study, we developed a machine learning (ML) model to predict T2D occurrence in the following year (Y + 1) using variables in the current year (Y). The dataset for this study was collected at a private medical institute as electronic health records from 2013 to 2018. To construct the prediction model, key features were first selected using ANOVA tests, chi-squared tests, and recursive feature elimination methods. The resultant features were fasting plasma glucose (FPG), HbA1c, triglycerides, BMI, gamma-GTP, age, uric acid, sex, smoking, drinking, physical activity, and family history. We then employed logistic regression, random forest, support vector machine, XGBoost, and ensemble machine learning algorithms based on these variables to predict the outcome as normal (non-diabetic), prediabetes, or diabetes. Based on the experimental results, the performance of the prediction model proved to be reasonably good at forecasting the occurrence of T2D in the Korean population. The model can provide clinicians and patients with valuable predictive information on the likelihood of developing T2D. The cross-validation (CV) results showed that the ensemble models had a superior performance to that of the single models. The CV performance of the prediction models was improved by incorporating more medical history from the dataset.
预测 2 型糖尿病(T2D)的发生可以使处于危险中的人采取预防疾病发生或延缓疾病进展的措施。在这项研究中,我们使用当年(Y)的变量开发了一种机器学习(ML)模型,用于预测次年(Y+1)T2D 的发生。该研究的数据来自 2013 年至 2018 年一家私立医疗机构的电子健康记录。为了构建预测模型,首先使用方差分析(ANOVA)检验、卡方检验和递归特征消除方法选择关键特征。由此产生的特征包括空腹血糖(FPG)、HbA1c、甘油三酯、BMI、γ-GTP、年龄、尿酸、性别、吸烟、饮酒、身体活动和家族史。然后,我们基于这些变量使用逻辑回归、随机森林、支持向量机、XGBoost 和集成机器学习算法来预测结果为正常(非糖尿病)、前驱糖尿病或糖尿病。基于实验结果,该预测模型在预测韩国人群 T2D 的发生方面表现出了相当好的性能。该模型可以为临床医生和患者提供有关发展为 T2D 的可能性的有价值的预测信息。交叉验证(CV)结果表明,集成模型的性能优于单个模型。通过从数据集中纳入更多的病史,可以提高预测模型的 CV 性能。