Liu Xu-Ying, Wu Jianxin, Zhou Zhi-Hua
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China.
IEEE Trans Syst Man Cybern B Cybern. 2009 Apr;39(2):539-50. doi: 10.1109/TSMCB.2008.2007853. Epub 2008 Dec 16.
Undersampling is a popular method in dealing with class-imbalance problems, which uses only a subset of the majority class and thus is very efficient. The main deficiency is that many majority class examples are ignored. We propose two algorithms to overcome this deficiency. EasyEnsemble samples several subsets from the majority class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascade trains the learners sequentially, where in each step, the majority class examples that are correctly classified by the current trained learners are removed from further consideration. Experimental results show that both methods have higher Area Under the ROC Curve, F-measure, and G-mean values than many existing class-imbalance learning methods. Moreover, they have approximately the same training time as that of undersampling when the same number of weak classifiers is used, which is significantly faster than other methods.
欠采样是处理类别不平衡问题的一种常用方法,它只使用多数类的一个子集,因此效率很高。其主要缺点是许多多数类样本被忽略。我们提出了两种算法来克服这一缺点。EasyEnsemble从多数类中采样几个子集,使用每个子集训练一个学习器,并将这些学习器的输出进行组合。BalanceCascade按顺序训练学习器,在每一步中,将当前训练好的学习器正确分类的多数类样本从进一步考虑中移除。实验结果表明,这两种方法在ROC曲线下面积、F值和G均值方面都比许多现有的类别不平衡学习方法更高。此外,当使用相同数量的弱分类器时,它们的训练时间与欠采样大致相同,这比其他方法要快得多。