Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China.
Shanxi Centre for Disease Control and Prevention, Taiyuan, 030012, Shanxi, China.
BMC Med Inform Decis Mak. 2021 Mar 20;21(1):105. doi: 10.1186/s12911-021-01471-4.
Diabetes Mellitus (DM) has become the third chronic non-communicable disease that hits patients after tumors, cardiovascular and cerebrovascular diseases, and has become one of the major public health problems in the world. Therefore, it is of great importance to identify individuals at high risk for DM in order to establish prevention strategies for DM.
Aiming at the problem of high-dimensional feature space and high feature redundancy of medical data, as well as the problem of data imbalance often faced. This study explored different supervised classifiers, combined with SVM-SMOTE and two feature dimensionality reduction methods (Logistic stepwise regression and LAASO) to classify the diabetes survey sample data with unbalanced categories and complex related factors. Analysis and discussion of the classification results of 4 supervised classifiers based on 4 data processing methods. Five indicators including Accuracy, Precision, Recall, F1-Score and AUC are selected as the key indicators to evaluate the performance of the classification model.
According to the result, Random Forest Classifier combining SVM-SMOTE resampling technology and LASSO feature screening method (Accuracy = 0.890, Precision = 0.869, Recall = 0.919, F1-Score = 0.893, AUC = 0.948) proved the best way to tell those at high risk of DM. Besides, the combined algorithm helps enhance the classification performance for prediction of high-risk people of DM. Also, age, region, heart rate, hypertension, hyperlipidemia and BMI are the top six most critical characteristic variables affecting diabetes.
The Random Forest Classifier combining with SVM-SMOTE and LASSO feature reduction method perform best in identifying high-risk people of DM from individuals. And the combined method proposed in the study would be a good tool for early screening of DM.
糖尿病(DM)已成为继肿瘤、心脑血管疾病之后危害患者的第三大慢性非传染性疾病,成为全球主要公共卫生问题之一。因此,识别 DM 高危个体,建立 DM 预防策略具有重要意义。
针对医学数据高维特征空间和高特征冗余,以及常面临的数据不平衡问题。本研究探索了不同的有监督分类器,结合 SVM-SMOTE 和两种特征降维方法(Logistic 逐步回归和 LAASO),对类别不平衡且相关因素复杂的糖尿病调查样本数据进行分类。分析和讨论了基于 4 种数据处理方法的 4 种有监督分类器的分类结果。选择准确率、精确率、召回率、F1-Score 和 AUC 五个指标作为评价分类模型性能的关键指标。
结果表明,结合 SVM-SMOTE 重采样技术和 LASSO 特征筛选方法的随机森林分类器(准确率=0.890、精确率=0.869、召回率=0.919、F1-Score=0.893、AUC=0.948)证明了识别 DM 高危人群的最佳方法。此外,联合算法有助于提高预测 DM 高危人群的分类性能。此外,年龄、地区、心率、高血压、高血脂和 BMI 是影响糖尿病的前六个最重要的特征变量。
随机森林分类器结合 SVM-SMOTE 和 LASSO 特征降维方法在识别 DM 高危人群方面表现最佳。并且研究中提出的联合方法将成为 DM 早期筛查的良好工具。