机器学习中合成少数类过采样技术的挑战与局限

Challenges and limitations of synthetic minority oversampling techniques in machine learning.

作者信息

Alkhawaldeh Ibraheem M, Albalkhi Ibrahem, Naswhan Abdulqadir Jeprel

机构信息

Faculty of Medicine, Mutah University, Karak 61710, Jordan.

Department of Neuroradiology, Alfaisal University, Great Ormond Street Hospital NHS Foundation Trust, London WC1N 3JH, United Kingdom.

出版信息

World J Methodol. 2023 Dec 20;13(5):373-378. doi: 10.5662/wjm.v13.i5.373.

DOI:10.5662/wjm.v13.i5.373

PMID:38229946

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10789107/

Abstract

Oversampling is the most utilized approach to deal with class-imbalanced datasets, as seen by the plethora of oversampling methods developed in the last two decades. We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class. These limitations should be considered when using oversampling techniques. We also propose several alternate strategies for dealing with imbalanced data, as well as a future work perspective.

摘要

过采样是处理类别不平衡数据集最常用的方法，过去二十年中大量过采样方法的出现就证明了这一点。在接下来的社论中，我们将讨论过采样存在的问题，这些问题源于过拟合的可能性以及生成的合成样本可能无法准确代表少数类。在使用过采样技术时应考虑这些局限性。我们还提出了几种处理不平衡数据的替代策略以及未来的工作展望。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/116c/10789107/ec4d64e85b49/WJM-13-373-g001.jpg

相似文献

Challenges and limitations of synthetic minority oversampling techniques in machine learning.机器学习中合成少数类过采样技术的挑战与局限

World J Methodol. 2023 Dec 20;13(5):373-378. doi: 10.5662/wjm.v13.i5.373.

Selective oversampling approach for strongly imbalanced data.针对严重不平衡数据的选择性过采样方法。

PeerJ Comput Sci. 2021 Jun 18;7:e604. doi: 10.7717/peerj-cs.604. eCollection 2021.

Predicting the Cochlear Dead Regions Using a Machine Learning-Based Approach with Oversampling Techniques.基于机器学习和过采样技术预测耳蜗死区。

Medicina (Kaunas). 2021 Nov 2;57(11):1192. doi: 10.3390/medicina57111192.

Imbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging.不平衡学习：改善磁共振成像中糖尿病周围神经病的分类。

PLoS One. 2020 Dec 15;15(12):e0243907. doi: 10.1371/journal.pone.0243907. eCollection 2020.

Deep Learning-Based Imbalanced Classification With Fuzzy Support Vector Machine.基于深度学习和模糊支持向量机的不平衡分类

Front Bioeng Biotechnol. 2022 Jan 21;9:802712. doi: 10.3389/fbioe.2021.802712. eCollection 2021.

RACOG and wRACOG: Two Probabilistic Oversampling Techniques.RACOG和wRACOG：两种概率性过采样技术。

IEEE Trans Knowl Data Eng. 2015 Jan 1;27(1):222-234. doi: 10.1109/TKDE.2014.2324567. Epub 2014 May 16.

Comparing Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition.比较处理人体活动识别中不平衡数据的采样策略。

Sensors (Basel). 2022 Feb 11;22(4):1373. doi: 10.3390/s22041373.

Oversampling the Minority Class in the Feature Space.在特征空间中对少数类进行过采样。

IEEE Trans Neural Netw Learn Syst. 2016 Sep;27(9):1947-61. doi: 10.1109/TNNLS.2015.2461436. Epub 2015 Aug 25.

Efficient treatment of outliers and class imbalance for diabetes prediction.高效处理糖尿病预测中的异常值和类别不平衡问题。

Artif Intell Med. 2020 Apr;104:101815. doi: 10.1016/j.artmed.2020.101815. Epub 2020 Feb 10.

DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data.深度SMOTE：融合深度学习与SMOTE处理不均衡数据

IEEE Trans Neural Netw Learn Syst. 2023 Sep;34(9):6390-6404. doi: 10.1109/TNNLS.2021.3136503. Epub 2023 Sep 1.

引用本文的文献

Predicting suicidality in people living with HIV in Uganda: a machine learning approach.预测乌干达艾滋病病毒感染者的自杀倾向：一种机器学习方法。

Front Psychiatry. 2025 Aug 15;16:1584335. doi: 10.3389/fpsyt.2025.1584335. eCollection 2025.

FADEL: Ensemble Learning Enhanced by Feature Augmentation and Discretization.法德尔：通过特征增强和离散化实现的集成学习

Bioengineering (Basel). 2025 Jul 30;12(8):827. doi: 10.3390/bioengineering12080827.

Radiomics-based classification of pediatric dental trauma in periapical radiographs: a preliminary study.基于放射组学的根尖片小儿牙外伤分类：一项初步研究。

BMC Med Imaging. 2025 Aug 19;25(1):336. doi: 10.1186/s12880-025-01877-w.

Deep Learning-Based Recurrence Prediction in HER2-Low Breast Cancer: Comparison of MRI-Alone, Clinicopathologic-Alone, and Combined Models.基于深度学习的HER2低表达乳腺癌复发预测：单纯MRI、单纯临床病理及联合模型的比较

Diagnostics (Basel). 2025 Jul 29;15(15):1895. doi: 10.3390/diagnostics15151895.

Predicting Emergency Severity Index (ESI) level, hospital admission, and admitting ward in an emergency department using data-driven machine learning.使用数据驱动的机器学习预测急诊科的急诊严重程度指数（ESI）级别、住院情况及收治科室。

BMC Med Inform Decis Mak. 2025 Jul 28;25(1):281. doi: 10.1186/s12911-025-02941-9.

Prediction of birthweight with early and mid-pregnancy antenatal markers utilising machine learning and explainable artificial intelligence.利用机器学习和可解释人工智能，通过孕早期和孕中期产前标志物预测出生体重。

Sci Rep. 2025 Jul 19;15(1):26223. doi: 10.1038/s41598-025-11837-7.

DrugProtAI: A machine learning-driven approach for predicting protein druggability through feature engineering and robust partition-based ensemble methods.DrugProtAI：一种通过特征工程和基于稳健划分的集成方法来预测蛋白质可成药性的机器学习驱动方法。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf330.

Translational approach for dementia subtype classification using convolutional neural network based on EEG connectome dynamics.基于脑电图连接组动力学的卷积神经网络用于痴呆症亚型分类的转化方法。

Sci Rep. 2025 May 19;15(1):17331. doi: 10.1038/s41598-025-02018-7.

Predicting sepsis treatment decisions in the paediatric emergency department using machine learning: the AiSEPTRON study.利用机器学习预测儿科急诊科的脓毒症治疗决策：AiSEPTRON研究

BMJ Paediatr Open. 2025 May 14;9(1):e003273. doi: 10.1136/bmjpo-2024-003273.

Optimizing predictive features using machine learning for early miscarriage risk following single vitrified-warmed blastocyst transfer.利用机器学习优化预测特征以评估单枚玻璃化冷冻-解冻囊胚移植后早期流产风险

Front Endocrinol (Lausanne). 2025 Apr 16;16:1557667. doi: 10.3389/fendo.2025.1557667. eCollection 2025.

本文引用的文献

Bias and Class Imbalance in Oncologic Data-Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets.肿瘤学数据中的偏差与类别不平衡——迈向大规模肿瘤学数据集中的包容性和可转移人工智能

Cancers (Basel). 2022 Jun 12;14(12):2897. doi: 10.3390/cancers14122897.

A comprehensive data level analysis for cancer diagnosis on imbalanced data.针对不平衡数据进行癌症诊断的全面数据级别分析。

J Biomed Inform. 2019 Feb;90:103089. doi: 10.1016/j.jbi.2018.12.003. Epub 2019 Jan 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

机器学习中合成少数类过采样技术的挑战与局限

Challenges and limitations of synthetic minority oversampling techniques in machine learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献