Suppr超能文献

基于灵活机器学习方法的类别不平衡环境下的糖尿病风险预测。

Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods.

机构信息

Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran.

Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.

出版信息

BMC Med Inform Decis Mak. 2022 Feb 10;22(1):36. doi: 10.1186/s12911-022-01775-z.

Abstract

BACKGROUND

Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging.

METHODS

Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, cost-sensitive learning, over and under-sampling strategies as solutions to class imbalance have been compared in improving algorithms performance.

RESULTS

DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, g-mean, and f1-measure in original imbalanced data. Changing threshold based on the maximum of f1-measure improved performance in g-mean, and f1-measure in three algorithms. Repeated edited nearest neighbors (RENN) under-sampling in DNN and cost-sensitive learning in tree-based algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and Precision-Recall AUCs, g-mean and f1-measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved g-mean and f1-measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and Precision-Recall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively.

CONCLUSION

G-mean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, under-sampling methods had better performance than others.

摘要

背景

通过基线测量早期发现和预测 2 型糖尿病的发病率,可以减少未来的相关并发症。与非糖尿病相比,糖尿病的发病率较低,这使得少数糖尿病类别的准确预测更加具有挑战性。

方法

在德黑兰血脂和血糖研究(TLGS)队列数据中,比较了深度神经网络(DNN)、极端梯度提升(XGBoost)和随机森林(RF)在预测少数糖尿病类别的性能。比较了改变阈值、代价敏感学习、过采样和欠采样策略作为解决类别不平衡的方法,以提高算法性能。

结果

DNN 在预测糖尿病方面的准确率最高,为 54.8%,在原始不平衡数据中,其 AUROC、g-mean 和 f1-measure 均优于 XGBoost 和 RF。基于 f1-measure 的最大值改变阈值可提高三种算法的 g-mean 和 f1-measure。在 DNN 中使用重复编辑最近邻(RENN)欠采样和基于树的算法中的代价敏感学习是解决不平衡问题的最佳解决方案。RENN 增加了 DNN 的 ROC 和 Precision-Recall AUCs、g-mean 和 f1-measure,从 0.857、0.603、0.713 和 0.575 分别增加到 0.862、0.608、0.773 和 0.583。在 XGBoost 中,加权从 0.667、0.554 分别提高到 0.776、0.588,在 RF 中,从 0.659、0.543 分别提高到 0.775、0.566。此外,RF 的 ROC 和 Precision-Recall AUC 也分别从 0.840、0.578 增加到 0.846、0.591。

结论

所有不平衡解决方案中,g-mean 的增幅最大。与重采样方法相比,加权和改变阈值是处理类别不平衡的更快速的解决方案。在采样策略中,欠采样方法的性能优于其他方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2819/8830137/a2e1ad8d307a/12911_2022_1775_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验