Yonsei University, Department of Computer Science, Republic of Korea.
J Biomed Inform. 2020 Mar;103:103381. doi: 10.1016/j.jbi.2020.103381. Epub 2020 Jan 28.
With the rapid advancement of technology and the necessity of processing large amounts of data, biomedical Named Entity Recognition (NER) has become an essential technique for information extraction in the biomedical field. NER, which is a sequence-labeling task, has been performed using various traditional techniques including dictionary-, rule-, machine learning-, and deep learning-based methods. However, as existing biomedical NER models are insufficient to handle new and unseen entity types from the growing biomedical data, the development of more effective and accurate biomedical NER models is being widely researched. Among biomedical NER models utilizing deep learning approaches, there have been only a few studies involving the design of high-level features in the embedding layer. In this regard, herein, we propose a deep learning NER model that effectively represents biomedical word tokens through the design of a combinatorial feature embedding. The proposed model is based on Bidirectional Long Short-Term Memory (bi-LSTM) with Conditional Random Field (CRF) and enhanced by integrating two different character-level representations extracted from a Convolutional Neural Network (CNN) and bi-LSTM. Additionally, an attention mechanism is applied to the model to focus on the relevant tokens in the sentence, which alleviates the long-term dependency problem of the LSTM model and allows effective recognition of entities. The proposed model was evaluated on two benchmark datasets, the JNLPBA and NCBI-Disease, and a comparative analysis with the existing models is performed. The proposed model achieved a relatively higher performance with an F1-score of 86.93% in case of NCBI-Disease, and a competitive performance for the JNLPBA with an F1-score of 75.31%.
随着技术的快速发展和处理大量数据的必要性,生物医学命名实体识别 (NER) 已成为生物医学领域信息提取的一项关键技术。NER 是一项序列标记任务,已经使用了各种传统技术进行了处理,包括基于字典、规则、机器学习和深度学习的方法。然而,由于现有的生物医学 NER 模型不足以处理来自不断增长的生物医学数据中新的和未见过的实体类型,因此正在广泛研究开发更有效和准确的生物医学 NER 模型。在利用深度学习方法的生物医学 NER 模型中,只有少数研究涉及到在嵌入层中设计高级特征。在这方面,本文提出了一种深度学习 NER 模型,通过设计组合特征嵌入来有效地表示生物医学单词标记。所提出的模型基于带有条件随机场 (CRF) 的双向长短时记忆网络 (bi-LSTM),并通过集成从卷积神经网络 (CNN) 和 bi-LSTM 中提取的两种不同的字符级表示来增强。此外,还将注意力机制应用于模型,以关注句子中的相关标记,这缓解了 LSTM 模型的长期依赖问题,并允许有效地识别实体。在所提出的模型在两个基准数据集 JNLPBA 和 NCBI-Disease 上进行了评估,并与现有模型进行了比较分析。在所提出的模型在 NCBI-Disease 上的 F1 得分为 86.93%,在 JNLPBA 上的 F1 得分为 75.31%,具有相对较高的性能,表现出了竞争性能。