Xu Zhaozhao, Shen Derong, Kou Yue, Nie Tiezheng
IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3740-3753. doi: 10.1109/TNNLS.2022.3197156. Epub 2024 Feb 29.
Data imbalance is a common phenomenon in machine learning. In the imbalanced data classification, minority samples are far less than majority samples, which makes it difficult for minority to be effectively learned by classifiers. A synthetic minority oversampling technique (SMOTE) improves the sensitivity of classifiers to minority by synthesizing minority samples without repetition. However, the process of synthesizing new samples in the SMOTE algorithm may lead to problems such as "noisy samples" and "boundary samples." Based on the above description, we propose a synthetic minority oversampling technique based on Gaussian mixture model filtering (GMF-SMOTE). GMF-SMOTE uses the expected maximum algorithm based on the Gaussian mixture model to group the imbalanced data. Then, the expected maximum filtering algorithm is used to filter out the "noisy samples" and "boundary samples" in the subclasses after grouping. Finally, to synthesize majority and minority samples, we design two dynamic oversampling ratios. Experimental results show that the GMF-SMOTE performs better than the traditional oversampling algorithms on 20 UCI datasets. The population averages of sensitivity and specificity indexes of random forest (RF) on the UCI datasets synthesized by GMF-SMOTE are 97.49% and 97.02%, respectively. In addition, we also record the G-mean and MCC indexes of the RF, which are 97.32% and 94.80%, respectively, significantly better than the traditional oversampling algorithms. More importantly, the two statistical tests show that GMF-SMOTE is significantly better than the traditional oversampling algorithms.
数据不平衡是机器学习中的常见现象。在不平衡数据分类中,少数类样本远少于多数类样本,这使得分类器难以有效学习少数类样本。合成少数类过采样技术(SMOTE)通过合成无重复的少数类样本提高了分类器对少数类的敏感性。然而,SMOTE算法中合成新样本的过程可能会导致“噪声样本”和“边界样本”等问题。基于上述描述,我们提出了一种基于高斯混合模型滤波的合成少数类过采样技术(GMF-SMOTE)。GMF-SMOTE使用基于高斯混合模型的期望最大化算法对不平衡数据进行分组。然后,使用期望最大化滤波算法在分组后的子类中滤除“噪声样本”和“边界样本”。最后,为了合成多数类和少数类样本,我们设计了两个动态过采样率。实验结果表明,GMF-SMOTE在20个UCI数据集上的性能优于传统过采样算法。在GMF-SMOTE合成的UCI数据集上,随机森林(RF)的敏感性和特异性指标的总体平均值分别为97.49%和97.02%。此外,我们还记录了RF的G均值和MCC指标,分别为97.32%和94.80%,明显优于传统过采样算法。更重要的是,两项统计测试表明GMF-SMOTE明显优于传统过采样算法。