Luo Wenshui, Chen Shuo, Liu Tongliang, Han Bo, Niu Gang, Sugiyama Masashi, Tao Dacheng, Gong Chen
IEEE Trans Pattern Anal Mach Intell. 2025 Jan;47(1):305-322. doi: 10.1109/TPAMI.2024.3466182. Epub 2024 Dec 4.
Real-world data may contain a considerable amount of noisily labeled examples, which usually mislead the training algorithm and result in degraded classification performance on test data. Therefore, Label Noise Learning (LNL) was proposed, of which one popular research trend focused on estimating the critical statistics (e.g., sample mean and sample covariance), to recover the clean data distribution. However, existing methods may suffer from the unreliable sample selection process or can hardly be applied to multi-class cases. Inspired by the centroid estimation theory, we propose Per-Class Statistic Estimation (PCSE), which establishes the quantitative relationship between the clean (first-order and second-order) statistics and the corresponding noisy statistics for every class. This relationship is further utilized to induce a generative classifier for model inference. Unlike existing methods, our approach does not require sample selection from the instance level. Moreover, our PCSE can serve as a general post-processing strategy applicable to various popular networks pre-trained on the noisy dataset for boosting their classification performance. Theoretically, we prove that the estimated statistics converge to their ground-truth values as the sample size increases, even if the estimated label transition matrix is biased. Empirically, we conducted intensive experiments on various binary and multi-class datasets, and the results demonstrate that PCSE achieves more precise statistic estimation as well as higher classification accuracy when compared with state-of-the-art methods in LNL.
现实世界的数据可能包含大量标注有噪声的示例,这通常会误导训练算法,并导致测试数据的分类性能下降。因此,提出了标签噪声学习(LNL),其一个流行的研究趋势集中在估计关键统计量(例如样本均值和样本协方差),以恢复干净的数据分布。然而,现有方法可能会受到不可靠的样本选择过程的影响,或者很难应用于多类情况。受质心估计理论的启发,我们提出了每类统计估计(PCSE),它为每个类建立了干净(一阶和二阶)统计量与相应噪声统计量之间的定量关系。这种关系被进一步用于诱导一个生成分类器进行模型推理。与现有方法不同,我们的方法不需要从实例级别进行样本选择。此外,我们的PCSE可以作为一种通用的后处理策略,适用于在有噪声数据集上预训练的各种流行网络,以提高它们的分类性能。从理论上讲,我们证明了即使估计的标签转移矩阵有偏差,随着样本量的增加,估计的统计量也会收敛到它们的真实值。从实验上讲,我们在各种二分类和多分类数据集上进行了大量实验,结果表明,与LNL中的现有方法相比,PCSE实现了更精确的统计估计以及更高的分类准确率。