IEEE Trans Cybern. 2015 Nov;45(11):2402-12. doi: 10.1109/TCYB.2014.2372060. Epub 2014 Dec 2.
Undersampling is a widely adopted method to deal with imbalance pattern classification problems. Current methods mainly depend on either random resampling on the majority class or resampling at the decision boundary. Random-based undersampling fails to take into consideration informative samples in the data while resampling at the decision boundary is sensitive to class overlapping. Both techniques ignore the distribution information of the training dataset. In this paper, we propose a diversified sensitivity-based undersampling method. Samples of the majority class are clustered to capture the distribution information and enhance the diversity of the resampling. A stochastic sensitivity measure is applied to select samples from both clusters of the majority class and the minority class. By iteratively clustering and sampling, a balanced set of samples yielding high classifier sensitivity is selected. The proposed method yields a good generalization capability for 14 UCI datasets.
欠采样是一种广泛采用的方法,用于处理不平衡模式分类问题。当前的方法主要依赖于对多数类进行随机重采样或在决策边界处进行重采样。基于随机的欠采样未能考虑到数据中有用的样本,而在决策边界处进行重采样则对类重叠敏感。这两种技术都忽略了训练数据集的分布信息。在本文中,我们提出了一种基于多样化敏感性的欠采样方法。对多数类的样本进行聚类,以捕获分布信息并增强重采样的多样性。应用随机敏感性度量从多数类和少数类的两个聚类中选择样本。通过迭代聚类和采样,选择产生高分类器敏感性的平衡样本集。所提出的方法在 14 个 UCI 数据集上具有良好的泛化能力。