Huang Zhan Ao, Sang Yongsheng, Sun Yanan, Lv Jiancheng
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):9252-9266. doi: 10.1109/TNNLS.2022.3231917. Epub 2024 Jul 8.
Most data in real life are characterized by imbalance problems. One of the classic models for dealing with imbalanced data is neural networks. However, the data imbalance problem often causes the neural network to display negative class preference behavior. Using an undersampling strategy to reconstruct a balanced dataset is one of the methods to alleviate the data imbalance problem. However, most existing undersampling methods focus more on the data or aim to preserve the overall structural characteristics of the negative class through potential energy estimation, while the problems of gradient inundation and insufficient empirical representation of positive samples have not been well considered. Therefore, a new paradigm for solving the data imbalance problem is proposed. Specifically, to solve the problem of gradient inundation, an informative undersampling strategy is derived from the performance degradation and used to restore the ability of neural networks to work under imbalanced data. In addition, to alleviate the problem of insufficient empirical representation of positive samples, a boundary expansion strategy with linear interpolation and the prediction consistency constraint is considered. We tested the proposed paradigm on 34 imbalanced datasets with imbalance ratios ranging from 16.90 to 100.14. The test results show that our paradigm obtained the best area under the receiver operating characteristic curve (AUC) on 26 datasets.
现实生活中的大多数数据都存在不平衡问题。处理不平衡数据的经典模型之一是神经网络。然而,数据不平衡问题常常导致神经网络表现出负类偏好行为。使用欠采样策略来重建平衡数据集是缓解数据不平衡问题的方法之一。然而,大多数现有的欠采样方法更多地关注数据,或者旨在通过势能估计来保留负类的整体结构特征,而梯度淹没和正样本经验表示不足的问题尚未得到充分考虑。因此,提出了一种解决数据不平衡问题的新范式。具体来说,为了解决梯度淹没问题,从性能退化中推导了一种信息欠采样策略,并用于恢复神经网络在不平衡数据下的工作能力。此外,为了缓解正样本经验表示不足的问题,考虑了一种具有线性插值和预测一致性约束的边界扩展策略。我们在34个不平衡率从16.90到100.14的不平衡数据集上测试了所提出的范式。测试结果表明,我们的范式在26个数据集上获得了最佳的受试者工作特征曲线下面积(AUC)。