Garcí-Pedrajas Nicolás E, Cuevas-Muñoz José M, de Haro-García Aida
Department of Computer Science, University of Córdoba, 14071 Córdoba, Spain
Evol Comput. 2025 Apr 16:1-35. doi: 10.1162/evco_a_00374.
One of the most common problems in data mining applications is the uneven distribution of classes, which appears in many real-world scenarios. The class of interest is often highly underrepresented in the given dataset, which harms the performance of most classifiers. One of the most successful methods for addressing the class imbalance problem is to oversample the minority class using synthetic samples. Since the original algorithm, the synthetic minority oversampling technique (SMOTE), introduced this method, numerous versions have emerged, each of which is based on a specific hypothesis about where and how to generate new synthetic instances. In this paper, we propose a different approach based exclusively on evolutionary computation that imposes no constraints on the creation of new synthetic instances. Majority class undersampling is also incorporated into the evolutionary process. A thorough comparison involving three classification methods, 85 datasets, and more than 90 class-imbalance strategies shows the advantages of our proposal.
数据挖掘应用中最常见的问题之一是类分布不均衡,这在许多现实世界场景中都会出现。在给定数据集中,感兴趣的类通常代表性严重不足,这会损害大多数分类器的性能。解决类不平衡问题最成功的方法之一是使用合成样本对少数类进行过采样。自从最初的算法——合成少数类过采样技术(SMOTE)引入这种方法以来,已经出现了许多版本,每个版本都基于关于在何处以及如何生成新的合成实例的特定假设。在本文中,我们提出了一种完全基于进化计算的不同方法,该方法对新合成实例的创建不设限制。多数类欠采样也被纳入进化过程。一项涉及三种分类方法、85个数据集和90多种类不平衡策略的全面比较显示了我们提议的优势。