Department of Computer Science, Edge Hill University, Ormskirk, United Kingdom.
Artif Intell Med. 2020 Apr;104:101815. doi: 10.1016/j.artmed.2020.101815. Epub 2020 Feb 10.
Learning from outliers and imbalanced data remains one of the major difficulties for machine learning classifiers. Among the numerous techniques dedicated to tackle this problem, data preprocessing solutions are known to be efficient and easy to implement. In this paper, we propose a selective data preprocessing approach that embeds knowledge of the outlier instances into artificially generated subset to achieve an even distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) was used to balance the training data by introducing artificial minority instances. However, this was not before the outliers were identified and oversampled (irrespective of class). The aim is to balance the training dataset while controlling the effect of outliers. The experiments prove that such selective oversampling empowers SMOTE, ultimately leading to improved classification performance.
从异常值和不平衡数据中学习仍然是机器学习分类器面临的主要困难之一。在众多专门用于解决此问题的技术中,数据预处理解决方案以其高效和易于实现而著称。在本文中,我们提出了一种选择性数据预处理方法,该方法将异常值实例的知识嵌入到人工生成的子集,以实现均匀分布。通过引入人工少数实例,合成少数过采样技术(SMOTE)可平衡训练数据。但是,在此之前,已经识别并过采样了异常值(与类别无关)。目的是在控制异常值影响的同时平衡训练数据集。实验证明,这种选择性过采样增强了 SMOTE 的能力,最终提高了分类性能。