Department of Computer Science and Technology, School of Mechanical Electronic and Information Engineering, China University of Mining and Technology, Beijing, 100083, China.
Department of Computer Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China.
PLoS One. 2018 Jul 26;13(7):e0197933. doi: 10.1371/journal.pone.0197933. eCollection 2018.
Deep learning techniques, e.g., Convolutional Neural Networks (CNNs), have been explosively applied to the research in the fields of information retrieval and natural language processing. However, few research efforts have addressed semantic indexing with deep learning. The use of semantic indexing in the biomedical literature has been limited for several reasons. For instance, MEDLINE citations contain a large number of semantic labels from automatically annotated MeSH terms, and for a great deal of the literature, only the information of the title and the abstract is readily available. In this paper, we propose a Boltzmann Convolutional neural network framework (B-CNN) for biomedicine semantic indexing. In our hybrid learning framework, the CNN can adaptively deal with features of documents that have sequence relationships, and can capture context information accordingly; the Deep Boltzmann Machine (DBM) merges global (the entity in each document) and local information through its training with undirected connections. Additionally, we have designed a hierarchical coarse to fine style indexing structure for learning and classifying documents, and a novel feature extension approach with word sequence embedding and Wikipedia categorization. Comparative experiments were conducted for semantic indexing of biomedical abstract documents; these experiments verified the encouraged performance of our B-CNN model.
深度学习技术,例如卷积神经网络(CNNs),已经在信息检索和自然语言处理领域的研究中得到了广泛的应用。然而,很少有研究工作涉及到深度学习的语义索引。由于以下几个原因,在生物医学文献中使用语义索引受到了限制。例如,MEDLINE 引文包含了大量自动标注的 MeSH 术语的语义标签,而且对于大量文献来说,只有标题和摘要的信息是现成的。在本文中,我们提出了一种用于生物医学语义索引的玻尔兹曼卷积神经网络框架(B-CNN)。在我们的混合学习框架中,CNN 可以自适应地处理具有序列关系的文档特征,并相应地捕获上下文信息;深度玻尔兹曼机(DBM)通过无向连接的训练,合并全局(每个文档中的实体)和局部信息。此外,我们还设计了一种层次式的从粗到精的索引结构,用于学习和分类文档,并提出了一种新颖的特征扩展方法,包括词序列嵌入和维基百科分类。我们对生物医学文摘文档的语义索引进行了对比实验,实验验证了我们的 B-CNN 模型的优越性能。