一种用于平衡具有类别不平衡的医学数据的多重组合方法。

A multiple combined method for rebalancing medical data with class imbalances.

机构信息

Department of Information Management, National Yunlin University of Science & Technology, Touliou, Yunlin, 640, Taiwan.

出版信息

Comput Biol Med. 2021 Jul;134:104527. doi: 10.1016/j.compbiomed.2021.104527. Epub 2021 May 31.

DOI:10.1016/j.compbiomed.2021.104527

Abstract

Most classification algorithms assume that classes are in a balanced state. However, datasets with class imbalances are everywhere. The classes of actual medical datasets are imbalanced, severely impacting identification models and even sacrificing the classification accuracy of the minority class, even though it is the most influential and representative. The medical field has irreversible characteristics. Its tolerance rate for misjudgment is relatively low, and errors may cause irreparable harm to patients. Therefore, this study proposes a multiple combined method to rebalance medical data featuring class imbalances. The combined methods include (1) resampling methods (synthetic minority oversampling technique [SMOTE] and undersampling [US]), (2) particle swarm optimization (PSO), and (3) MetaCost. This study conducted two experiments with nine medical datasets to verify and compare the proposed method with the listing methods. A decision tree is used to generate decision rules for easy understanding of the research results. The results show that (1) the proposed method with ensemble learning can improve the area under a receiver operating characteristic curve (AUC), recall, precision, and F1 metrics; (2) MetaCost can increase sensitivity; (3) SMOTE can effectively enhance AUC; (4) US can improve sensitivity, F1, and misclassification costs in data with a high-class imbalance ratio; and (5) PSO-based attribute selection can increase sensitivity and reduce data dimension. Finally, we suggest that the dataset with an imbalanced ratio >9 must use the US results to make the decision. As the imbalanced ratio is < 9, the decision-maker can simultaneously consider the results of SMOTE and US to identify the best decision.

摘要

大多数分类算法都假设类别处于平衡状态。然而，存在类别不平衡的数据集无处不在。实际医疗数据集的类别是不平衡的，严重影响识别模型，甚至牺牲少数类别的分类准确性，尽管它是最有影响力和代表性的。医疗领域具有不可逆转的特点。它对误判的容忍率相对较低，错误可能会对患者造成无法挽回的伤害。因此，本研究提出了一种多重组合方法来重新平衡具有类别不平衡的医疗数据。组合方法包括 (1) 重采样方法（合成少数过采样技术 [SMOTE] 和欠采样 [US]）、(2) 粒子群优化 (PSO) 和 (3) MetaCost。本研究使用九个医疗数据集进行了两项实验，以验证和比较所提出的方法与列出的方法。决策树用于生成决策规则，便于理解研究结果。结果表明：(1) 具有集成学习的提议方法可以提高接收者操作特性曲线 (AUC)、召回率、精度和 F1 指标下的面积；(2) MetaCost 可以提高灵敏度；(3) SMOTE 可以有效提高 AUC；(4) US 可以提高高类别不平衡比数据的灵敏度、F1 和误分类成本；(5) 基于 PSO 的属性选择可以提高灵敏度并降低数据维度。最后，我们建议不平衡比>9 的数据集必须使用 US 结果做出决策。不平衡比<9 时，决策者可以同时考虑 SMOTE 和 US 的结果，以识别最佳决策。