IEEE Trans Pattern Anal Mach Intell. 2011 Oct;33(10):2093-103. doi: 10.1109/TPAMI.2011.45. Epub 2011 Mar 10.
Semi-supervised classification--training both on labeled and unlabeled observations--can yield improved performance compared to the classifier based on only the labeled observations. Unlabeled observations are always beneficial to classification if the model we assume is correct. However, they may degrade the classifier performance when the model is misspecified. In the classical classification problem setting, many factors affect the semi-supervised performance, including training data, model specification, estimation method, and the classifier itself. For concreteness, we consider maximum likelihood estimation in finite mixture models and the Bayes plug-in classifier, due to their ubiquitousness and tractability. In this specific setting, we examine the effect of model misspecification on semi-supervised classification performance and shed some light on when and why performance degradation occurs.
半监督分类——在有标签和无标签观测值上进行训练——可以比仅基于有标签观测值的分类器获得更好的性能。如果我们假设的模型是正确的,无标签观测值对于分类总是有益的。然而,当模型指定错误时,它们可能会降低分类器的性能。在经典的分类问题设置中,许多因素会影响半监督性能,包括训练数据、模型规范、估计方法和分类器本身。为了具体起见,我们考虑有限混合模型中的最大似然估计和贝叶斯插件分类器,因为它们具有普遍性和可处理性。在这个特定的设置中,我们研究了模型指定错误对半监督分类性能的影响,并阐明了性能下降发生的时间和原因。