IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):2841-2855. doi: 10.1109/TPAMI.2020.3044997. Epub 2022 May 5.
In this paper, we propose a general framework termed centroid estimation with guaranteed efficiency (CEGE) for weakly supervised learning (WSL) with incomplete, inexact, and inaccurate supervision. The core of our framework is to devise an unbiased and statistically efficient risk estimator that is applicable to various weak supervision. Specifically, by decomposing the loss function (e.g., the squared loss and hinge loss) into a label-independent term and a label-dependent term, we discover that only the latter is influenced by the weak supervision and is related to the centroid of the entire dataset. Therefore, by constructing two auxiliary pseudo-labeled datasets with synthesized labels, we derive unbiased estimates of centroid based on the two auxiliary datasets, respectively. These two estimates are further linearly combined with a properly decided coefficient which makes the final combined estimate not only unbiased but also statistically efficient. This is better than some existing methods that only care about the unbiasedness of estimation but ignore the statistical efficiency. The good statistical efficiency of the derived estimator is guaranteed as we theoretically prove that it acquires the minimum variance when estimating the centroid. As a result, intensive experimental results on a large number of benchmark datasets demonstrate that our CEGE generally obtains better performance than the existing approaches related to typical WSL problems including semi-supervised learning, positive-unlabeled learning, multiple instance learning, and label noise learning.
在本文中,我们提出了一个名为“具有保证效率的质心估计”(Centroid Estimation with Guaranteed Efficiency,CEGE)的通用框架,用于处理不完全、不精确和不准确监督的弱监督学习(Weakly Supervised Learning,WSL)。我们框架的核心是设计一个无偏且统计有效的风险估计器,适用于各种弱监督。具体来说,通过将损失函数(例如,平方损失和 hinge 损失)分解为标签独立项和标签依赖项,我们发现只有后者受弱监督影响,并且与整个数据集的质心有关。因此,通过构建两个带有合成标签的辅助伪标记数据集,我们分别从两个辅助数据集推导出质心的无偏估计量。然后,这两个估计值进一步通过一个适当的决策系数进行线性组合,使得最终的组合估计值不仅无偏,而且具有统计效率。这比一些仅关注估计无偏性但忽略统计效率的现有方法要好。推导的估计器具有良好的统计效率,因为我们从理论上证明了它在估计质心时具有最小方差。结果,在大量基准数据集上进行的大量实验结果表明,我们的 CEGE 通常比现有的与典型 WSL 问题相关的方法表现更好,包括半监督学习、正无标签学习、多实例学习和标签噪声学习。