Wang Qi, Luo ZhiHao, Huang JinCai, Feng YangHe, Liu Zhong
Science and Technology on Information Systems Engineering Laboratory, College of Information System and Management, National University of Defense Technology, Changsha, Hunan, China.
Comput Intell Neurosci. 2017;2017:1827016. doi: 10.1155/2017/1827016. Epub 2017 Jan 30.
Class imbalance ubiquitously exists in real life, which has attracted much interest from various domains. Direct learning from imbalanced dataset may pose unsatisfying results overfocusing on the accuracy of identification and deriving a suboptimal model. Various methodologies have been developed in tackling this problem including sampling, cost-sensitive, and other hybrid ones. However, the samples near the decision boundary which contain more discriminative information should be valued and the skew of the boundary would be corrected by constructing synthetic samples. Inspired by the truth and sense of geometry, we designed a new synthetic minority oversampling technique to incorporate the borderline information. What is more, ensemble model always tends to capture more complicated and robust decision boundary in practice. Taking these factors into considerations, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with imbalanced data learning (IDL) problems. Experiments on open access datasets showed significant superior performance using our model and a persuasive and intuitive explanation behind the method was illustrated. As far as we know, this is the first model combining ensemble of SVMs with borderline information for solving such condition.
类别不平衡在现实生活中普遍存在,这引起了各个领域的广泛关注。直接从不平衡数据集中学习可能会导致不尽人意的结果,因为过于关注识别准确率而得出次优模型。为解决这个问题,人们开发了各种方法,包括采样、代价敏感以及其他混合方法。然而,靠近决策边界且包含更多判别信息的样本应得到重视,并且可以通过构造合成样本纠正边界的偏差。受几何原理和直观感受的启发,我们设计了一种新的合成少数类过采样技术来整合边界信息。此外,在实践中,集成模型总是倾向于捕捉更复杂、更稳健的决策边界。考虑到这些因素,我们提出了一种名为Bagging of Extrapolation Borderline-SMOTE SVM(BEBS)的新型集成方法来处理不平衡数据学习(IDL)问题。在开放获取数据集上进行的实验表明,我们的模型具有显著的优越性能,并对该方法背后给出了有说服力且直观的解释。据我们所知,这是第一个将支持向量机集成与边界信息相结合来解决此类情况的模型。