平衡之举：解决机器学习中数据不平衡问题，以预测 2 型糖尿病中的心肌梗死。

Balancing Acts: Tackling Data Imbalance in Machine Learning for Predicting Myocardial Infarction in Type 2 Diabetes.

机构信息

University of York, York, YO10 5GH, UK.

Bradford Teaching Hospitals NHS Foundation Trust, Bradford, BD9 6RJ, UK.

出版信息

Stud Health Technol Inform. 2024 Aug 22;316:626-630. doi: 10.3233/SHTI240491.

DOI:10.3233/SHTI240491

PMID:39176819

Abstract

Type 2 Diabetes (T2D) is a prevalent lifelong health condition. It is predicted that over 500 million adults will be diagnosed with T2D by 2040. T2D can develop at any age, and if it progresses, it may cause serious comorbidities. One of the most critical T2D-related comorbidities is Myocardial Infarction (MI), known as heart attack. MI is a life-threatening medical emergency, and it is important to predict it and intervene in a timely manner. The use of Machine Learning (ML) for clinical prediction is gaining pace, but the class imbalance in predictive models is a key challenge for establishing a trustworthy deployment of the technology. This may lead to bias and overfitting in the ML models, and it may cause misleading interpretations of the ML outputs. In our study, we showed how systematic use of Class Imbalance Handling (CIH) techniques may improve the performance of the ML models. We used the Connected Bradford dataset, consisting of over one million real-world health records. Three commonly used CIH techniques, Oversampling, Undersampling, and Class Weighting (CW) have been used for Naive Bayes (NB), Neural Network (NN), Random Forest (RF), Support Vector Machine (SVM), and Ensemble models. We report that CW overperforms among the other techniques with the highest Accuracy and F1 values of 0.9948 and 0.9556, respectively. Applying the most appropriate CIH techniques for the ML models using real-world healthcare data provides promising results for helping to reduce the risk of MI in patients with T2D.

摘要

2 型糖尿病（T2D）是一种普遍存在的终身健康状况。预计到 2040 年，将有超过 5 亿成年人被诊断患有 T2D。T2D 可发生于任何年龄，如果病情进展，可能会导致严重的合并症。T2D 相关的最严重合并症之一是心肌梗死（MI），也就是心脏病发作。MI 是一种危及生命的医疗紧急情况，及时预测和干预非常重要。机器学习（ML）在临床预测中的应用正在加速，但预测模型中的类别不平衡是建立该技术可信部署的关键挑战。这可能导致 ML 模型中的偏差和过拟合，并可能导致对 ML 输出的误导性解释。在我们的研究中，我们展示了如何系统地使用类别不平衡处理（CIH）技术来提高 ML 模型的性能。我们使用了由超过 100 万份真实健康记录组成的 Connected Bradford 数据集。我们使用了三种常用的 CIH 技术，即过采样、欠采样和类别加权（CW），用于朴素贝叶斯（NB）、神经网络（NN）、随机森林（RF）、支持向量机（SVM）和集成模型。我们报告说，CW 的表现优于其他技术，其准确率和 F1 值最高，分别为 0.9948 和 0.9556。使用真实的医疗保健数据为 ML 模型应用最合适的 CIH 技术，为帮助降低 T2D 患者患 MI 的风险提供了有希望的结果。