Wei Chao, Luo Senlin, Ma Xincheng, Ren Hao, Zhang Ji, Pan Limin
Beijing Institute of Technology, Beijing, 10081, China.
PLoS One. 2016 Jan 19;11(1):e0146672. doi: 10.1371/journal.pone.0146672. eCollection 2016.
Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.
主题模型和神经网络能够发现文本语料库中有意义的低维潜在表示;因此,它们已成为文档表示的关键技术。然而,此类模型假定所有文档都是无差别的,这导致潜在表示依赖于所有其他文档,并且无法提供有区分性的文档表示。为了解决这个问题,我们提出了一种受流形启发的半监督自动编码器,以提取文档中有意义的潜在表示,从局部角度来看,附近文档的潜在表示应该是相关的。我们首先在观测空间中用欧几里得距离确定有区分性的邻居集。然后,通过联合最小化输入与输出之间的伯努利交叉熵误差以及输入与输出的邻居之间的平方误差之和来训练自动编码器。两个广泛使用的语料库的结果表明,与比较方法相比,我们的方法在文档聚类方面至少提高了15%,在分类任务方面提高了近7%。证据表明,我们的方法能够轻松捕获新文档中更具区分性的潜在表示。此外,通过激活促进潜在表示可理解性的特征,可以有效地发现一些有意义的词组合。