• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

预测成年人糖尿病:使用机器学习算法在 5 年队列研究中识别不平衡数据中的重要特征。

Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm.

机构信息

Noncommunicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran.

Student of Biostatistics, Department of Biostatistics and Epidemiology, School of Public Health, Kerman University of Medical Sciences, Kerman, Iran.

出版信息

BMC Med Res Methodol. 2024 Sep 27;24(1):220. doi: 10.1186/s12874-024-02341-z.

DOI:10.1186/s12874-024-02341-z
PMID:39333899
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11430121/
Abstract

BACKGROUND

Imbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes.

METHODS

We employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN, Random Over Sampling and KMeansSMOTE, paired with Random Forest, Gradient Boosting, Decision Tree and Multi-Layer Perceptron (MLP) classifier. We evaluated model performance using F1 score, AUC, and G-means-metrics chosen to provide a comprehensive assessment of model accuracy, discrimination ability, and overall balance in performance, particularly in the context of imbalanced datasets.

RESULTS

our study uncovered key factors influencing diabetes risk and evaluated the performance of various machine learning models. Feature importance analysis revealed that the most influential predictors of diabetes differ between males and females. For females, the most important factors are triglyceride (TG), basal metabolic rate (BMR), and total cholesterol (CHOL), whereas for males, the key predictors are body Mass Index (BMI), serum glutamate Oxaloacetate Transaminase (SGOT), and Gamma-Glutamyl (GGT). Across the entire dataset, BMI remains the most important variable, followed by SGOT, BMR, and energy intake. These insights suggest that gender-specific risk profiles should be considered in diabetes prevention and management strategies. In terms of model performance, our results show that ADASYN with MLP classifier achieved an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-means of 89.15 ± 2.31. SMOTE with MLP followed closely with an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-means of 89.31 ± 2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27 ± 1.54, AUC of 87.18 ± 1.12, and G-means of 86.47 ± 1.28.

CONCLUSION

These combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis.

摘要

背景

不平衡数据集在预测建模中带来了重大挑战,导致结果出现偏差,模型可靠性降低。本研究利用机器学习技术解决糖尿病预测中的数据不平衡问题。我们利用 Fasa 成人队列研究(FACS)的数据,对 10000 名参与者进行了 5 年的随访,开发了 2 型糖尿病预测模型。

方法

我们采用了各种数据级和算法级干预措施,包括 SMOTE、ADASYN、SMOTEENN、随机过采样和 KMeansSMOTE,以及随机森林、梯度提升、决策树和多层感知机(MLP)分类器。我们使用 F1 分数、AUC 和 G-均值指标来评估模型性能,这些指标旨在提供模型准确性、区分能力和性能整体平衡的综合评估,特别是在不平衡数据集的情况下。

结果

我们的研究揭示了影响糖尿病风险的关键因素,并评估了各种机器学习模型的性能。特征重要性分析表明,糖尿病的最重要预测因子在男性和女性之间有所不同。对于女性,最重要的因素是甘油三酯(TG)、基础代谢率(BMR)和总胆固醇(CHOL),而对于男性,关键预测因子是体重指数(BMI)、血清谷氨酸草酰乙酸转氨酶(SGOT)和γ-谷氨酰基(GGT)。在整个数据集上,BMI 仍然是最重要的变量,其次是 SGOT、BMR 和能量摄入。这些结果表明,在糖尿病预防和管理策略中,应考虑性别特异性的风险概况。就模型性能而言,我们的结果表明,ADASYN 与 MLP 分类器的 F1 得分为 82.17±3.38,AUC 为 89.61±2.09,G-均值为 89.15±2.31。SMOTE 与 MLP 紧随其后,F1 得分为 79.85±3.91,AUC 为 89.70±2.54,G-均值为 89.31±2.78。SMOTEENN 与随机森林的组合的 F1 得分为 78.27±1.54,AUC 为 87.18±1.12,G-均值为 86.47±1.28。

结论

这些组合有效地解决了类别不平衡问题,提高了糖尿病预测的准确性和可靠性。这些结果强调了在医学数据分析中使用适当的数据平衡技术的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/393f9d12d713/12874_2024_2341_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/7c7e278db355/12874_2024_2341_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/811dfa0a14ba/12874_2024_2341_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/b9ea5b701549/12874_2024_2341_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/188bf7358ce1/12874_2024_2341_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/7e780997a60d/12874_2024_2341_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/c27bad731d70/12874_2024_2341_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/96bd98a859ee/12874_2024_2341_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/70cae19404f7/12874_2024_2341_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/e5d3a8b16122/12874_2024_2341_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/b96942d6055e/12874_2024_2341_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/393f9d12d713/12874_2024_2341_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/7c7e278db355/12874_2024_2341_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/811dfa0a14ba/12874_2024_2341_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/b9ea5b701549/12874_2024_2341_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/188bf7358ce1/12874_2024_2341_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/7e780997a60d/12874_2024_2341_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/c27bad731d70/12874_2024_2341_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/96bd98a859ee/12874_2024_2341_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/70cae19404f7/12874_2024_2341_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/e5d3a8b16122/12874_2024_2341_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/b96942d6055e/12874_2024_2341_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6769/11430121/393f9d12d713/12874_2024_2341_Fig11_HTML.jpg

相似文献

1
Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm.预测成年人糖尿病:使用机器学习算法在 5 年队列研究中识别不平衡数据中的重要特征。
BMC Med Res Methodol. 2024 Sep 27;24(1):220. doi: 10.1186/s12874-024-02341-z.
2
Machine learning algorithms for predicting COVID-19 mortality in Ethiopia.用于预测埃塞俄比亚 COVID-19 死亡率的机器学习算法。
BMC Public Health. 2024 Jun 28;24(1):1728. doi: 10.1186/s12889-024-19196-0.
3
Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage.利用电子病历数据构建机器学习模型的联合建模策略:以脑出血为例。
BMC Med Inform Decis Mak. 2022 Oct 25;22(1):278. doi: 10.1186/s12911-022-02018-x.
4
Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms.创伤性损伤患者的医院死亡率预测:比较不同基于 SMOTE 的机器学习算法。
BMC Med Res Methodol. 2023 Apr 22;23(1):101. doi: 10.1186/s12874-023-01920-w.
5
Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?预测模型工具能否识别 ACL 重建术后阿片类药物使用时间延长的高风险患者?
Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.
6
Machine-learning algorithms in screening for type 2 diabetes mellitus: Data from Fasa Adults Cohort Study.机器学习算法在 2 型糖尿病筛查中的应用:来自法萨成年人队列研究的数据。
Endocrinol Diabetes Metab. 2024 Mar;7(2):e00472. doi: 10.1002/edm2.472.
7
Predictive model and risk analysis for peripheral vascular disease in type 2 diabetes mellitus patients using machine learning and shapley additive explanation.基于机器学习和 Shapley 加法解释的 2 型糖尿病患者外周血管疾病预测模型和风险分析。
Front Endocrinol (Lausanne). 2024 Feb 28;15:1320335. doi: 10.3389/fendo.2024.1320335. eCollection 2024.
8
Diabetes prediction model based on GA-XGBoost and stacking ensemble algorithm.基于 GA-XGBoost 和堆叠集成算法的糖尿病预测模型。
PLoS One. 2024 Sep 30;19(9):e0311222. doi: 10.1371/journal.pone.0311222. eCollection 2024.
9
Development and Validation of a Machine Learning Algorithm to Predict the Risk of Blood Transfusion after Total Hip Replacement in Patients with Femoral Neck Fractures: A Multicenter Retrospective Cohort Study.用于预测股骨颈骨折患者全髋关节置换术后输血风险的机器学习算法的开发与验证:一项多中心回顾性队列研究
Orthop Surg. 2024 Aug;16(8):2066-2080. doi: 10.1111/os.14160. Epub 2024 Jul 1.
10
Comparing machine learning algorithms to predict COVID‑19 mortality using a dataset including chest computed tomography severity score data.比较机器学习算法,使用包含胸部计算机断层扫描严重程度评分数据的数据集来预测 COVID-19 死亡率。
Sci Rep. 2023 Jul 13;13(1):11343. doi: 10.1038/s41598-023-38133-6.

引用本文的文献

1
Integrating ensemble machine learning and multi-omics approaches to identify Dp44mT as a novel anti- agent targeting cellular iron homeostasis.整合集成机器学习和多组学方法以确定Dp44mT作为一种靶向细胞铁稳态的新型抗癌剂。
Front Pharmacol. 2025 Apr 24;16:1574990. doi: 10.3389/fphar.2025.1574990. eCollection 2025.
2
Machine learning and artificial intelligence in type 2 diabetes prediction: a comprehensive 33-year bibliometric and literature analysis.机器学习与人工智能在2型糖尿病预测中的应用:一项为期33年的全面文献计量学与文献分析
Front Digit Health. 2025 Mar 27;7:1557467. doi: 10.3389/fdgth.2025.1557467. eCollection 2025.

本文引用的文献

1
A cohort study on the predictive capability of body composition for diabetes mellitus using machine learning.一项利用机器学习对身体成分预测糖尿病能力的队列研究。
J Diabetes Metab Disord. 2023 Nov 27;23(1):773-781. doi: 10.1007/s40200-023-01350-x. eCollection 2024 Jun.
2
Cardiovascular and Kidney Risks in Individuals With Type 2 Diabetes: Contemporary Understanding With Greater Emphasis on Excess Adiposity.2 型糖尿病个体的心血管和肾脏风险:更多关注超重的当代理解
Diabetes Care. 2024 Apr 1;47(4):531-543. doi: 10.2337/dci23-0041.
3
Machine-learning algorithms in screening for type 2 diabetes mellitus: Data from Fasa Adults Cohort Study.
机器学习算法在 2 型糖尿病筛查中的应用:来自法萨成年人队列研究的数据。
Endocrinol Diabetes Metab. 2024 Mar;7(2):e00472. doi: 10.1002/edm2.472.
4
Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting.基于梯度提升(XGBoost、LightGBM、CatBoost)和基于注意力的 CNN-LSTM 的集成机器学习用于有害藻华预测。
Toxins (Basel). 2023 Oct 10;15(10):608. doi: 10.3390/toxins15100608.
5
An improved AdaBoost algorithm for identification of lung cancer based on electronic nose.一种基于电子鼻的用于肺癌识别的改进型AdaBoost算法。
Heliyon. 2023 Feb 21;9(3):e13633. doi: 10.1016/j.heliyon.2023.e13633. eCollection 2023 Mar.
6
Cohort Profile: The Fasa Adults Cohort Study (FACS): a prospective study of non-communicable diseases risks.队列简介:法萨成年人队列研究(FACS):一项关于非传染性疾病风险的前瞻性研究。
Int J Epidemiol. 2023 Jun 6;52(3):e172-e178. doi: 10.1093/ije/dyac241.
7
Early Prediction of Diabetes Using an Ensemble of Machine Learning Models.使用机器学习模型集成进行糖尿病早期预测。
Int J Environ Res Public Health. 2022 Sep 28;19(19):12378. doi: 10.3390/ijerph191912378.
8
Predicting the Risk of Incident Type 2 Diabetes Mellitus in Chinese Elderly Using Machine Learning Techniques.使用机器学习技术预测中国老年人患2型糖尿病的风险
J Pers Med. 2022 May 31;12(6):905. doi: 10.3390/jpm12060905.
9
Diabetic kidney disease and risk of incident stroke among adults with type 2 diabetes.糖尿病肾病与 2 型糖尿病患者卒中事件风险的关系。
BMC Med. 2022 Mar 29;20(1):127. doi: 10.1186/s12916-022-02317-0.
10
Prevalence, Deaths and Disability-Adjusted-Life-Years (DALYs) Due to Type 2 Diabetes and Its Attributable Risk Factors in 204 Countries and Territories, 1990-2019: Results From the Global Burden of Disease Study 2019.2019 年全球疾病负担研究:1990 年至 2019 年 204 个国家和地区 2 型糖尿病及其归因风险因素的患病率、死亡率和伤残调整生命年(DALYs)。
Front Endocrinol (Lausanne). 2022 Feb 25;13:838027. doi: 10.3389/fendo.2022.838027. eCollection 2022.