IEEE Trans Neural Netw Learn Syst. 2015 Mar;26(3):444-57. doi: 10.1109/TNNLS.2014.2315526.
When the amount of labeled data are limited, semisupervised learning can improve the learner's performance by also using the often easily available unlabeled data. In particular, a popular approach requires the learned function to be smooth on the underlying data manifold. By approximating this manifold as a weighted graph, such graph-based techniques can often achieve state-of-the-art performance. However, their high time and space complexities make them less attractive on large data sets. In this paper, we propose to scale up graph-based semisupervised learning using a set of sparse prototypes derived from the data. These prototypes serve as a small set of data representatives, which can be used to approximate the graph-based regularizer and to control model complexity. Consequently, both training and testing become much more efficient. Moreover, when the Gaussian kernel is used to define the graph affinity, a simple and principled method to select the prototypes can be obtained. Experiments on a number of real-world data sets demonstrate encouraging performance and scaling properties of the proposed approach. It also compares favorably with models learned via l1 -regularization at the same level of model sparsity. These results demonstrate the efficacy of the proposed approach in producing highly parsimonious and accurate models for semisupervised learning.
当标记数据的数量有限时,半监督学习可以通过同时使用通常易于获得的未标记数据来提高学习者的性能。特别是,一种流行的方法要求学习到的函数在底层数据流形上是平滑的。通过将这个流形近似为一个加权图,基于图的技术通常可以实现最先进的性能。然而,它们的高时间和空间复杂度使得它们在大数据集上的吸引力降低。在本文中,我们提出了一种使用从数据中提取的一组稀疏原型来扩展基于图的半监督学习的方法。这些原型作为数据的一个小的代表集合,可以用来近似基于图的正则化项,并控制模型的复杂度。因此,训练和测试都变得更加高效。此外,当使用高斯核来定义图的相似性时,可以得到一种简单而有原则的选择原型的方法。在一些真实世界数据集上的实验表明了所提出方法的令人鼓舞的性能和扩展特性。它也与在相同模型稀疏性水平下通过 l1 正则化学习的模型进行了比较,结果表明了所提出的方法在生成简洁而准确的半监督学习模型方面的有效性。