机器学习中合成少数类过采样技术的挑战与局限

Challenges and limitations of synthetic minority oversampling techniques in machine learning.

作者信息

Alkhawaldeh Ibraheem M, Albalkhi Ibrahem, Naswhan Abdulqadir Jeprel

机构信息

Faculty of Medicine, Mutah University, Karak 61710, Jordan.

Department of Neuroradiology, Alfaisal University, Great Ormond Street Hospital NHS Foundation Trust, London WC1N 3JH, United Kingdom.

出版信息

World J Methodol. 2023 Dec 20;13(5):373-378. doi: 10.5662/wjm.v13.i5.373.

Abstract

Oversampling is the most utilized approach to deal with class-imbalanced datasets, as seen by the plethora of oversampling methods developed in the last two decades. We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class. These limitations should be considered when using oversampling techniques. We also propose several alternate strategies for dealing with imbalanced data, as well as a future work perspective.

摘要

过采样是处理类别不平衡数据集最常用的方法,过去二十年中大量过采样方法的出现就证明了这一点。在接下来的社论中,我们将讨论过采样存在的问题,这些问题源于过拟合的可能性以及生成的合成样本可能无法准确代表少数类。在使用过采样技术时应考虑这些局限性。我们还提出了几种处理不平衡数据的替代策略以及未来的工作展望。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/116c/10789107/ec4d64e85b49/WJM-13-373-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索