IEEE Trans Cybern. 2017 Sep;47(9):2850-2861. doi: 10.1109/TCYB.2016.2579658. Epub 2016 Jun 21.
Class imbalance problems, where the number of samples in each class is unequal, is prevalent in numerous real world machine learning applications. Traditional methods which are biased toward the majority class are ineffective due to the relative severity of misclassifying rare events. This paper proposes a novel evolutionary cluster-based oversampling ensemble framework, which combines a novel cluster-based synthetic data generation method with an evolutionary algorithm (EA) to create an ensemble. The proposed synthetic data generation method is based on contemporary ideas of identifying oversampling regions using clusters. The novel use of EA serves a twofold purpose of optimizing the parameters of the data generation method while generating diverse examples leveraging on the characteristics of EAs, reducing overall computational cost. The proposed method is evaluated on a set of 40 imbalance datasets obtained from the University of California, Irvine, database, and outperforms current state-of-the-art ensemble algorithms tackling class imbalance problems.
类不平衡问题,即每个类别的样本数量不等,在众多现实世界的机器学习应用中普遍存在。传统方法偏向于多数类,由于稀有事件的误分类相对严重,因此效果不佳。本文提出了一种新颖的基于进化聚类的过采样集成框架,该框架结合了一种新颖的基于聚类的合成数据生成方法和一种进化算法 (EA) 来创建一个集成。所提出的合成数据生成方法基于使用聚类来识别过采样区域的现代思想。EA 的新颖用途有两个目的:优化数据生成方法的参数,同时利用 EA 的特点生成多样化的示例,从而降低整体计算成本。所提出的方法在一组从加利福尼亚大学欧文分校数据库获得的 40 个不平衡数据集上进行了评估,优于当前解决类不平衡问题的最先进的集成算法。