多类不平衡数据集分类中的过采样技术综述：对医学问题的见解

A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems.

作者信息

Yang Yuxuan, Khorshidi Hadi Akbarzadeh, Aickelin Uwe

机构信息

School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia.

Cancer Health Services Research, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, Australia.

出版信息

Front Digit Health. 2024 Jul 26;6:1430245. doi: 10.3389/fdgth.2024.1430245. eCollection 2024.

DOI:10.3389/fdgth.2024.1430245

PMID:39131184

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11310152/

Abstract

There has been growing attention to multi-class classification problems, particularly those challenges of imbalanced class distributions. To address these challenges, various strategies, including data-level re-sampling treatment and ensemble methods, have been introduced to bolster the performance of predictive models and Artificial Intelligence (AI) algorithms in scenarios where excessive level of imbalance is present. While most research and algorithm development have been focused on binary classification problems, in health informatics there is an increased interest in the field to address the problem of multi-class classification in imbalanced datasets. Multi-class imbalance problems bring forth more complex challenges, as a delicate approach is required to generate synthetic data and simultaneously maintain the relationship between the multiple classes. The aim of this review paper is to examine over-sampling methods tailored for medical and other datasets with multi-class imbalance. Out of 2,076 peer-reviewed papers identified through searches, 197 eligible papers were chosen and thoroughly reviewed for inclusion, narrowing to 37 studies being selected for in-depth analysis. These studies are categorised into four categories: metric, adaptive, structure-based, and hybrid approaches. The most significant finding is the emerging trend toward hybrid resampling methods that combine the strengths of various techniques to effectively address the problem of imbalanced data. This paper provides an extensive analysis of each selected study, discusses their findings, and outlines directions for future research.

摘要

多类分类问题，尤其是类别分布不均衡的挑战，已受到越来越多的关注。为应对这些挑战，人们引入了各种策略，包括数据级重采样处理和集成方法，以提高预测模型和人工智能（AI）算法在存在过度不均衡情况的场景中的性能。虽然大多数研究和算法开发都集中在二元分类问题上，但在健康信息学领域，人们对解决不均衡数据集中的多类分类问题的兴趣日益浓厚。多类不均衡问题带来了更复杂的挑战，因为需要一种精细的方法来生成合成数据，同时保持多个类别之间的关系。这篇综述论文的目的是研究为具有多类不均衡的医学和其他数据集量身定制的过采样方法。通过搜索确定的2076篇同行评审论文中，有197篇符合条件的论文被选中并进行了全面审查以纳入其中，最终筛选出37项研究进行深入分析。这些研究分为四类：度量方法、自适应方法、基于结构的方法和混合方法。最显著的发现是混合重采样方法的新兴趋势，即结合各种技术的优势来有效解决数据不均衡问题。本文对每项选定的研究进行了广泛分析，讨论了它们的发现，并概述了未来研究方向。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1ed0/11310152/f9adf0a7cca7/fdgth-06-1430245-g001.jpg

相似文献

A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems.多类不平衡数据集分类中的过采样技术综述：对医学问题的见解

Front Digit Health. 2024 Jul 26;6:1430245. doi: 10.3389/fdgth.2024.1430245. eCollection 2024.

RSMOTE: improving classification performance over imbalanced medical datasets.RSMOTE：提升不平衡医学数据集的分类性能

Health Inf Sci Syst. 2020 Jun 12;8(1):22. doi: 10.1007/s13755-020-00112-w. eCollection 2020 Dec.

Classifying adverse drug reactions from imbalanced twitter data.从不平衡的推特数据中分类药物不良反应。

Int J Med Inform. 2019 Sep;129:122-132. doi: 10.1016/j.ijmedinf.2019.05.017. Epub 2019 May 30.

Improved support vector machine classification for imbalanced medical datasets by novel hybrid sampling combining modified mega-trend-diffusion and bagging extreme learning machine model.通过结合改进的大趋势扩散和装袋极限学习机模型的新型混合采样，改进不平衡医学数据集的支持向量机分类。

Math Biosci Eng. 2023 Sep 15;20(10):17672-17701. doi: 10.3934/mbe.2023786.

Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets.基于结构-活性关系的高度不平衡Tox21数据集的化学分类

J Cheminform. 2020 Oct 27;12(1):66. doi: 10.1186/s13321-020-00468-x.

A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets.基于 Pareto 的特征和实例选择集成学习方法在多类不平衡数据集上的应用。

Int J Neural Syst. 2017 Sep;27(6):1750028. doi: 10.1142/S0129065717500289. Epub 2017 Apr 11.

Experimental Study and Comparison of Imbalance Ensemble Classifiers with Dynamic Selection Strategy.具有动态选择策略的不平衡集成分类器的实验研究与比较

Entropy (Basel). 2021 Jun 28;23(7):822. doi: 10.3390/e23070822.

AI and semantic ontology for personalized activity eCoaching in healthy lifestyle recommendations: a meta-heuristic approach.人工智能和语义本体在健康生活方式推荐中的个性化活动电子教练中的应用：一种启发式方法。

BMC Med Inform Decis Mak. 2023 Dec 1;23(1):278. doi: 10.1186/s12911-023-02364-4.

Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification.基于自适应群体聚类的动态多目标合成少数类过采样技术算法，用于处理生物医学数据分类中的二元不平衡数据集。

BioData Min. 2016 Dec 1;9:37. doi: 10.1186/s13040-016-0117-1. eCollection 2016.

Inverse free reduced universum twin support vector machine for imbalanced data classification.用于不平衡数据分类的逆自由约简全域孪生支持向量机

Neural Netw. 2023 Jan;157:125-135. doi: 10.1016/j.neunet.2022.10.003. Epub 2022 Oct 15.

引用本文的文献

Radiomics-based classification of pediatric dental trauma in periapical radiographs: a preliminary study.基于放射组学的根尖片小儿牙外伤分类：一项初步研究。

BMC Med Imaging. 2025 Aug 19;25(1):336. doi: 10.1186/s12880-025-01877-w.

A novel deep learning technique for multi classify Alzheimer disease: hyperparameter optimization technique.一种用于多分类阿尔茨海默病的新型深度学习技术：超参数优化技术。

Front Artif Intell. 2025 Apr 24;8:1558725. doi: 10.3389/frai.2025.1558725. eCollection 2025.

Physiological, Psychological, and Functional Health Determinants of Depressive Symptoms Among the Elderly in India: Evaluation of Classification Performance of XGBoost Models.印度老年人抑郁症状的生理、心理和功能健康决定因素：XGBoost模型分类性能评估

Indian J Psychol Med. 2025 Jan 25:02537176241311196. doi: 10.1177/02537176241311196.

本文引用的文献

An oversampling method for multi-class imbalanced data based on composite weights.基于组合权重的多类不平衡数据过采样方法。

PLoS One. 2021 Nov 12;16(11):e0259227. doi: 10.1371/journal.pone.0259227. eCollection 2021.

Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification.基于进化马氏距离的多类不平衡数据分类过采样方法

Sensors (Basel). 2021 Oct 4;21(19):6616. doi: 10.3390/s21196616.

Machine Learning for the Diagnosis of Parkinson's Disease: A Review of Literature.用于帕金森病诊断的机器学习：文献综述

Front Aging Neurosci. 2021 May 6;13:633752. doi: 10.3389/fnagi.2021.633752. eCollection 2021.

Radial-Based Oversampling for Multiclass Imbalanced Data Classification.基于径向基的多类不平衡数据分类过采样方法

IEEE Trans Neural Netw Learn Syst. 2020 Aug;31(8):2818-2831. doi: 10.1109/TNNLS.2019.2913673. Epub 2019 Jun 21.

Machine learning. Clustering by fast search and find of density peaks.机器学习。基于密度峰值的快速搜索和发现的聚类。

Science. 2014 Jun 27;344(6191):1492-6. doi: 10.1126/science.1242072.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多类不平衡数据集分类中的过采样技术综述：对医学问题的见解

A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献