Suppr超能文献

局部嵌入自动编码器:一种文档表示的半监督流形学习方法。

Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation.

作者信息

Wei Chao, Luo Senlin, Ma Xincheng, Ren Hao, Zhang Ji, Pan Limin

机构信息

Beijing Institute of Technology, Beijing, 10081, China.

出版信息

PLoS One. 2016 Jan 19;11(1):e0146672. doi: 10.1371/journal.pone.0146672. eCollection 2016.

Abstract

Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.

摘要

主题模型和神经网络能够发现文本语料库中有意义的低维潜在表示;因此,它们已成为文档表示的关键技术。然而,此类模型假定所有文档都是无差别的,这导致潜在表示依赖于所有其他文档,并且无法提供有区分性的文档表示。为了解决这个问题,我们提出了一种受流形启发的半监督自动编码器,以提取文档中有意义的潜在表示,从局部角度来看,附近文档的潜在表示应该是相关的。我们首先在观测空间中用欧几里得距离确定有区分性的邻居集。然后,通过联合最小化输入与输出之间的伯努利交叉熵误差以及输入与输出的邻居之间的平方误差之和来训练自动编码器。两个广泛使用的语料库的结果表明,与比较方法相比,我们的方法在文档聚类方面至少提高了15%,在分类任务方面提高了近7%。证据表明,我们的方法能够轻松捕获新文档中更具区分性的潜在表示。此外,通过激活促进潜在表示可理解性的特征,可以有效地发现一些有意义的词组合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9b1/4718658/1aeb9ba6e8c1/pone.0146672.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验