IEEE Trans Neural Netw Learn Syst. 2016 Sep;27(9):1947-61. doi: 10.1109/TNNLS.2015.2461436. Epub 2015 Aug 25.
The imbalanced nature of some real-world data is one of the current challenges for machine learning researchers. One common approach oversamples the minority class through convex combination of its patterns. We explore the general idea of synthetic oversampling in the feature space induced by a kernel function (as opposed to input space). If the kernel function matches the underlying problem, the classes will be linearly separable and synthetically generated patterns will lie on the minority class region. Since the feature space is not directly accessible, we use the empirical feature space (EFS) (a Euclidean space isomorphic to the feature space) for oversampling purposes. The proposed method is framed in the context of support vector machines, where the imbalanced data sets can pose a serious hindrance. The idea is investigated in three scenarios: 1) oversampling in the full and reduced-rank EFSs; 2) a kernel learning technique maximizing the data class separation to study the influence of the feature space structure (implicitly defined by the kernel function); and 3) a unified framework for preferential oversampling that spans some of the previous approaches in the literature. We support our investigation with extensive experiments over 50 imbalanced data sets.
一些现实世界数据的不平衡性质是机器学习研究人员目前面临的挑战之一。一种常见的方法是通过对少数类模式的凸组合来对其进行过采样。我们在核函数(而不是输入空间)诱导的特征空间中探索了综合过采样的一般思想。如果核函数与潜在问题匹配,则类将是线性可分的,并且合成生成的模式将位于少数类区域。由于特征空间无法直接访问,因此我们使用经验特征空间(EFS)(与特征空间同构的欧几里得空间)进行过采样。所提出的方法是在支持向量机的上下文中提出的,其中不平衡数据集可能会造成严重的阻碍。该想法在三种情况下进行了研究:1)在全秩和降秩 EFS 中进行过采样;2)一种最大化数据类分离的核学习技术,以研究特征空间结构(由核函数隐式定义)的影响;3)一种统一的优先过采样框架,涵盖了文献中的一些先前方法。我们通过在 50 多个不平衡数据集上进行广泛的实验来支持我们的研究。