Department of Computer Science and Engineering, Graduate School, Soongsil University, Seoul, Korea.
PLoS One. 2022 Jul 28;17(7):e0271260. doi: 10.1371/journal.pone.0271260. eCollection 2022.
In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.
在许多分类问题中,类别的分布并不均衡。例如,在疾病诊断和信用卡欺诈检测等领域,正例很少。众所周知,对于这种不平衡的分类,一般的机器学习方法效果不佳。一种流行的解决方案是在应用机器学习算法之前,通过对代表性不足的类别进行过采样(或对代表性过高的类别进行欠采样)来平衡训练数据。然而,尽管这种方法很流行,但采样的有效性并没有得到严格和全面的评估。本研究使用了 31 个具有不同不平衡程度的数据集,评估了七种采样方法和八种机器学习分类器(总共 56 种组合)。我们使用精度-召回率曲线下面积(AUPRC)和接收者操作特征曲线(AUROC)作为性能指标。与 AUROC 相比,AUPRC 更适合不平衡分类。我们发现,只有在少数情况下(AUPRC 中为 12.2%,AUROC 中为 10.0%),采样才会显著改变分类器的性能(配对 t 检验 P < 0.05)。令人惊讶的是,采样更有可能降低而不是提高分类性能。此外,采样对 AUPRC 的不利影响比 AUROC 更明显。在采样方法中,欠采样的效果比其他方法差。此外,采样更有利于提高线性分类器的性能。最重要的是,我们不需要采样就能为大多数 31 个数据集找到最优的分类器。此外,我们还发现了两个有趣的例子,在这两个例子中,采样显著降低了 AUPRC,同时显著提高了 AUROC(配对 t 检验 P < 0.05)。总之,采样的适用性有限,因为它可能无效甚至有害。此外,性能指标的选择对决策至关重要。我们的研究结果为不平衡分类中的采样效果和特征提供了有价值的见解。