Fujino Akinori, Ueda Naonori, Saito Kazumi
NTT Communication Science Laboratories, NTT Corporation, Soraku-Gun, Kyoto, Japan.
IEEE Trans Pattern Anal Mach Intell. 2008 Mar;30(3):424-37. doi: 10.1109/TPAMI.2007.70710.
This paper presents a method for designing semi-supervised classifiers trained on labeled and unlabeled samples. We focus on probabilistic semi-supervised classifier design for multi-class and single-labeled classification problems, and propose a hybrid approach that takes advantage of generative and discriminative approaches. In our approach, we first consider a generative model trained by using labeled samples and introduce a bias correction model, where these models belong to the same model family, but have different parameters. Then, we construct a hybrid classifier by combining these models based on the maximum entropy principle. To enable us to apply our hybrid approach to text classification problems, we employed naive Bayes models as the generative and bias correction models. Our experimental results for four text data sets confirmed that the generalization ability of our hybrid classifier was much improved by using a large number of unlabeled samples for training when there were too few labeled samples to obtain good performance. We also confirmed that our hybrid approach significantly outperformed generative and discriminative approaches when the performance of the generative and discriminative approaches was comparable. Moreover, we examined the performance of our hybrid classifier when the labeled and unlabeled data distributions were different.
本文提出了一种在有标签和无标签样本上训练半监督分类器的设计方法。我们专注于针对多类单标签分类问题的概率半监督分类器设计,并提出一种利用生成式方法和判别式方法的混合方法。在我们的方法中,我们首先考虑一个使用有标签样本训练的生成模型,并引入一个偏差校正模型,其中这些模型属于同一模型族,但具有不同的参数。然后,我们基于最大熵原理将这些模型组合起来构建一个混合分类器。为了能够将我们的混合方法应用于文本分类问题,我们采用朴素贝叶斯模型作为生成模型和偏差校正模型。我们对四个文本数据集的实验结果证实,当有标签样本太少而无法获得良好性能时,通过使用大量无标签样本进行训练,我们的混合分类器的泛化能力有了很大提高。我们还证实,当生成式方法和判别式方法的性能相当,我们的混合方法显著优于生成式方法和判别式方法。此外,我们研究了有标签和无标签数据分布不同时我们的混合分类器的性能。