Yang Yuxuan, Khorshidi Hadi Akbarzadeh, Aickelin Uwe
School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia.
Cancer Health Services Research, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, Australia.
Front Digit Health. 2024 Jul 26;6:1430245. doi: 10.3389/fdgth.2024.1430245. eCollection 2024.
There has been growing attention to multi-class classification problems, particularly those challenges of imbalanced class distributions. To address these challenges, various strategies, including data-level re-sampling treatment and ensemble methods, have been introduced to bolster the performance of predictive models and Artificial Intelligence (AI) algorithms in scenarios where excessive level of imbalance is present. While most research and algorithm development have been focused on binary classification problems, in health informatics there is an increased interest in the field to address the problem of multi-class classification in imbalanced datasets. Multi-class imbalance problems bring forth more complex challenges, as a delicate approach is required to generate synthetic data and simultaneously maintain the relationship between the multiple classes. The aim of this review paper is to examine over-sampling methods tailored for medical and other datasets with multi-class imbalance. Out of 2,076 peer-reviewed papers identified through searches, 197 eligible papers were chosen and thoroughly reviewed for inclusion, narrowing to 37 studies being selected for in-depth analysis. These studies are categorised into four categories: metric, adaptive, structure-based, and hybrid approaches. The most significant finding is the emerging trend toward hybrid resampling methods that combine the strengths of various techniques to effectively address the problem of imbalanced data. This paper provides an extensive analysis of each selected study, discusses their findings, and outlines directions for future research.
多类分类问题,尤其是类别分布不均衡的挑战,已受到越来越多的关注。为应对这些挑战,人们引入了各种策略,包括数据级重采样处理和集成方法,以提高预测模型和人工智能(AI)算法在存在过度不均衡情况的场景中的性能。虽然大多数研究和算法开发都集中在二元分类问题上,但在健康信息学领域,人们对解决不均衡数据集中的多类分类问题的兴趣日益浓厚。多类不均衡问题带来了更复杂的挑战,因为需要一种精细的方法来生成合成数据,同时保持多个类别之间的关系。这篇综述论文的目的是研究为具有多类不均衡的医学和其他数据集量身定制的过采样方法。通过搜索确定的2076篇同行评审论文中,有197篇符合条件的论文被选中并进行了全面审查以纳入其中,最终筛选出37项研究进行深入分析。这些研究分为四类:度量方法、自适应方法、基于结构的方法和混合方法。最显著的发现是混合重采样方法的新兴趋势,即结合各种技术的优势来有效解决数据不均衡问题。本文对每项选定的研究进行了广泛分析,讨论了它们的发现,并概述了未来研究方向。