Cho Hyejin, Choi Wonjun, Lee Hyunju
School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 123 Chemdangwagi-ro, Buk-gu, Gwangju, Republic of Korea.
BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.
In biomedical articles, a named entity recognition (NER) technique that identifies entity names from texts is an important element for extracting biological knowledge from articles. After NER is applied to articles, the next step is to normalize the identified names into standard concepts (i.e., disease names are mapped to the National Library of Medicine's Medical Subject Headings disease terms). In biomedical articles, many entity normalization methods rely on domain-specific dictionaries for resolving synonyms and abbreviations. However, the dictionaries are not comprehensive except for some entities such as genes. In recent years, biomedical articles have accumulated rapidly, and neural network-based algorithms that incorporate a large amount of unlabeled data have shown considerable success in several natural language processing problems.
In this study, we propose an approach for normalizing biological entities, such as disease names and plant names, by using word embeddings to represent semantic spaces. For diseases, training data from the National Center for Biotechnology Information (NCBI) disease corpus and unlabeled data from PubMed abstracts were used to construct word representations. For plants, a training corpus that we manually constructed and unlabeled PubMed abstracts were used to represent word vectors. We showed that the proposed approach performed better than the use of only the training corpus or only the unlabeled data and showed that the normalization accuracy was improved by using our model even when the dictionaries were not comprehensive. We obtained F-scores of 0.808 and 0.690 for normalizing the NCBI disease corpus and manually constructed plant corpus, respectively. We further evaluated our approach using a data set in the disease normalization task of the BioCreative V challenge. When only the disease corpus was used as a dictionary, our approach significantly outperformed the best system of the task.
The proposed approach shows robust performance for normalizing biological entities. The manually constructed plant corpus and the proposed model are available at http://gcancer.org/plant and http://gcancer.org/normalization , respectively.
在生物医学文章中,一种从文本中识别实体名称的命名实体识别(NER)技术是从文章中提取生物知识的重要元素。在将NER应用于文章之后,下一步是将识别出的名称规范化为标准概念(即疾病名称映射到美国国立医学图书馆的医学主题词疾病术语)。在生物医学文章中,许多实体规范化方法依赖于特定领域的词典来解决同义词和缩写问题。然而,除了一些实体(如基因)外,这些词典并不全面。近年来,生物医学文章迅速积累,基于神经网络的算法结合大量未标记数据在几个自然语言处理问题上取得了显著成功。
在本研究中,我们提出了一种通过使用词嵌入来表示语义空间来规范化生物实体(如疾病名称和植物名称)的方法。对于疾病,使用来自美国国立生物技术信息中心(NCBI)疾病语料库的训练数据和来自PubMed摘要的未标记数据来构建词表示。对于植物,使用我们手动构建的训练语料库和未标记的PubMed摘要来表示词向量。我们表明,所提出的方法比仅使用训练语料库或仅使用未标记数据的方法表现更好,并且表明即使词典不全面,使用我们的模型也能提高规范化准确性。对于NCBI疾病语料库和手动构建的植物语料库的规范化,我们分别获得了0.808和0.690的F分数。我们使用BioCreative V挑战赛疾病规范化任务中的数据集进一步评估了我们的方法。当仅将疾病语料库用作词典时,我们的方法显著优于该任务的最佳系统。
所提出的方法在规范化生物实体方面表现出强大的性能。手动构建的植物语料库和所提出的模型分别可在http://gcancer.org/plant和http://gcancer.org/normalization上获取。