IEEE Trans Pattern Anal Mach Intell. 2016 Mar;38(3):462-75. doi: 10.1109/TPAMI.2015.2452921.
Improvement guarantees for semi-supervised classifiers can currently only be given under restrictive conditions on the data. We propose a general way to perform semi-supervised parameter estimation for likelihood-based classifiers for which, on the full training set, the estimates are never worse than the supervised solution in terms of the log-likelihood. We argue, moreover, that we may expect these solutions to really improve upon the supervised classifier in particular cases. In a worked-out example for LDA, we take it one step further and essentially prove that its semi-supervised version is strictly better than its supervised counterpart. The two new concepts that form the core of our estimation principle are contrast and pessimism. The former refers to the fact that our objective function takes the supervised estimates into account, enabling the semi-supervised solution to explicitly control the potential improvements over this estimate. The latter refers to the fact that our estimates are conservative and therefore resilient to whatever form the true labeling of the unlabeled data takes on. Experiments demonstrate the improvements in terms of both the log-likelihood and the classification error rate on independent test sets.
目前,仅在数据满足限制条件的情况下,才能为半监督分类器提供改进保证。我们提出了一种针对基于似然的分类器进行半监督参数估计的通用方法,对于这些分类器,在整个训练集上,根据对数似然,估计值在任何情况下都不会逊于有监督的解决方案。此外,我们还认为,在某些特定情况下,这些解决方案可能确实会优于有监督的分类器。在一个关于 LDA 的实例中,我们更进一步,实质上证明了其半监督版本严格优于有监督版本。构成我们估计原理核心的两个新概念是对比度和悲观度。前者是指我们的目标函数考虑了有监督的估计值,从而使半监督解能够明确控制相对于该估计值的潜在改进。后者是指我们的估计值是保守的,因此,无论未标记数据的真实标签采取何种形式,都能保持稳健。实验证明了在独立测试集上,对数似然和分类错误率方面的改进。