Suppr超能文献

过采样处理高维类别不平衡数据。

SMOTE for high-dimensional class-imbalanced data.

机构信息

Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia.

出版信息

BMC Bioinformatics. 2013 Mar 22;14:106. doi: 10.1186/1471-2105-14-106.

Abstract

BACKGROUND

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

RESULTS

While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

CONCLUSIONS

In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

摘要

背景

使用不平衡数据进行分类会偏向多数类。对于高维数据,变量数量大大超过样本数量,这种偏差更大。通过欠采样或过采样可以产生平衡类的数据,从而减轻该问题。通常,欠采样是有帮助的,而随机过采样则不然。合成少数类过采样技术(SMOTE)是一种非常流行的过采样方法,旨在改进随机过采样,但尚未对其在高维数据中的行为进行彻底研究。本文从理论和实证的角度研究了 SMOTE 的特性,使用模拟和真实的高维数据。

结果

虽然在大多数情况下,SMOTE 在低维数据中似乎是有益的,但对于大多数分类器,当数据为高维时,它并不能减轻对多数类分类的偏差,而且它的效果不如随机欠采样。对于高维数据,SMOTE 对 k-NN 分类器有益,如果执行某种类型的变量选择,则可以减少变量数量;我们解释了原因,否则,k-NN 分类会偏向少数类。此外,我们表明,在高维数据上,SMOTE 不会改变类别特定的平均值,而会降低数据的可变性并引入样本之间的相关性。我们解释了这些发现如何影响高维数据的类别预测。

结论

在实践中,在高维设置中,只有基于欧几里得距离的 k-NN 分类器似乎会从使用 SMOTE 中获得实质性的好处,前提是在使用 SMOTE 之前执行变量选择;如果使用更多的邻居,收益会更大。如果不进行变量选择,用于 k-NN 的 SMOTE 不应使用,因为它会强烈偏向少数类分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88e8/3648438/b6d46a503ff0/1471-2105-14-106-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验