Noncommunicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran.
Student of Biostatistics, Department of Biostatistics and Epidemiology, School of Public Health, Kerman University of Medical Sciences, Kerman, Iran.
BMC Med Res Methodol. 2024 Sep 27;24(1):220. doi: 10.1186/s12874-024-02341-z.
Imbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes.
We employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN, Random Over Sampling and KMeansSMOTE, paired with Random Forest, Gradient Boosting, Decision Tree and Multi-Layer Perceptron (MLP) classifier. We evaluated model performance using F1 score, AUC, and G-means-metrics chosen to provide a comprehensive assessment of model accuracy, discrimination ability, and overall balance in performance, particularly in the context of imbalanced datasets.
our study uncovered key factors influencing diabetes risk and evaluated the performance of various machine learning models. Feature importance analysis revealed that the most influential predictors of diabetes differ between males and females. For females, the most important factors are triglyceride (TG), basal metabolic rate (BMR), and total cholesterol (CHOL), whereas for males, the key predictors are body Mass Index (BMI), serum glutamate Oxaloacetate Transaminase (SGOT), and Gamma-Glutamyl (GGT). Across the entire dataset, BMI remains the most important variable, followed by SGOT, BMR, and energy intake. These insights suggest that gender-specific risk profiles should be considered in diabetes prevention and management strategies. In terms of model performance, our results show that ADASYN with MLP classifier achieved an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-means of 89.15 ± 2.31. SMOTE with MLP followed closely with an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-means of 89.31 ± 2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27 ± 1.54, AUC of 87.18 ± 1.12, and G-means of 86.47 ± 1.28.
These combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis.
不平衡数据集在预测建模中带来了重大挑战,导致结果出现偏差,模型可靠性降低。本研究利用机器学习技术解决糖尿病预测中的数据不平衡问题。我们利用 Fasa 成人队列研究(FACS)的数据,对 10000 名参与者进行了 5 年的随访,开发了 2 型糖尿病预测模型。
我们采用了各种数据级和算法级干预措施,包括 SMOTE、ADASYN、SMOTEENN、随机过采样和 KMeansSMOTE,以及随机森林、梯度提升、决策树和多层感知机(MLP)分类器。我们使用 F1 分数、AUC 和 G-均值指标来评估模型性能,这些指标旨在提供模型准确性、区分能力和性能整体平衡的综合评估,特别是在不平衡数据集的情况下。
我们的研究揭示了影响糖尿病风险的关键因素,并评估了各种机器学习模型的性能。特征重要性分析表明,糖尿病的最重要预测因子在男性和女性之间有所不同。对于女性,最重要的因素是甘油三酯(TG)、基础代谢率(BMR)和总胆固醇(CHOL),而对于男性,关键预测因子是体重指数(BMI)、血清谷氨酸草酰乙酸转氨酶(SGOT)和γ-谷氨酰基(GGT)。在整个数据集上,BMI 仍然是最重要的变量,其次是 SGOT、BMR 和能量摄入。这些结果表明,在糖尿病预防和管理策略中,应考虑性别特异性的风险概况。就模型性能而言,我们的结果表明,ADASYN 与 MLP 分类器的 F1 得分为 82.17±3.38,AUC 为 89.61±2.09,G-均值为 89.15±2.31。SMOTE 与 MLP 紧随其后,F1 得分为 79.85±3.91,AUC 为 89.70±2.54,G-均值为 89.31±2.78。SMOTEENN 与随机森林的组合的 F1 得分为 78.27±1.54,AUC 为 87.18±1.12,G-均值为 86.47±1.28。
这些组合有效地解决了类别不平衡问题,提高了糖尿病预测的准确性和可靠性。这些结果强调了在医学数据分析中使用适当的数据平衡技术的重要性。