Tang Yuchun, Zhang Yan-Qing, Chawla Nitesh V, Krasser Sven
IEEE Trans Syst Man Cybern B Cybern. 2009 Feb;39(1):281-8. doi: 10.1109/TSMCB.2008.2002909. Epub 2008 Dec 9.
Traditional classification algorithms can be limited in their performance on highly unbalanced data sets. A popular stream of work for countering the problem of class imbalance has been the application of a sundry of sampling strategies. In this paper, we focus on designing modifications to support vector machines (SVMs) to appropriately tackle the problem of class imbalance. We incorporate different "rebalance" heuristics in SVM modeling, including cost-sensitive learning, and over- and undersampling. These SVM-based strategies are compared with various state-of-the-art approaches on a variety of data sets by using various metrics, including G-mean, area under the receiver operating characteristic curve, F-measure, and area under the precision/recall curve. We show that we are able to surpass or match the previously known best algorithms on each data set. In particular, of the four SVM variations considered in this paper, the novel granular SVMs-repetitive undersampling algorithm (GSVM-RU) is the best in terms of both effectiveness and efficiency. GSVM-RU is effective, as it can minimize the negative effect of information loss while maximizing the positive effect of data cleaning in the undersampling process. GSVM-RU is efficient by extracting much less support vectors and, hence, greatly speeding up SVM prediction.
传统分类算法在处理高度不平衡数据集时,其性能可能会受到限制。解决类别不平衡问题的一个流行的工作方向是应用各种采样策略。在本文中,我们专注于对支持向量机(SVM)进行设计修改,以妥善处理类别不平衡问题。我们在SVM建模中纳入了不同的“重新平衡”启发式方法,包括成本敏感学习以及过采样和欠采样。通过使用各种指标,包括G均值、接收者操作特征曲线下的面积、F值以及精确率/召回率曲线下的面积,将这些基于SVM的策略与各种数据集上的各种最新方法进行比较。我们表明,在每个数据集上,我们都能够超越或匹配先前已知的最佳算法。特别是,在本文考虑的四种SVM变体中,新颖的粒度SVM - 重复欠采样算法(GSVM - RU)在有效性和效率方面都是最佳的。GSVM - RU是有效的,因为它可以在欠采样过程中最大限度地减少信息损失的负面影响,同时最大化数据清理的积极影响。GSVM - RU通过提取少得多的支持向量,从而极大地加快了SVM预测速度,因此是高效的。