Graduate School of Knowledge Service Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea.
Department of Industrial & Systems Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea.
BMC Bioinformatics. 2020 Feb 11;21(1):53. doi: 10.1186/s12859-020-3393-1.
Biomedical named-entity recognition (BioNER) is widely modeled with conditional random fields (CRF) by regarding it as a sequence labeling problem. The CRF-based methods yield structured outputs of labels by imposing connectivity between the labels. Recent studies for BioNER have reported state-of-the-art performance by combining deep learning-based models (e.g., bidirectional Long Short-Term Memory) and CRF. The deep learning-based models in the CRF-based methods are dedicated to estimating individual labels, whereas the relationships between connected labels are described as static numbers; thereby, it is not allowed to timely reflect the context in generating the most plausible label-label transitions for a given input sentence. Regardless, correctly segmenting entity mentions in biomedical texts is challenging because the biomedical terms are often descriptive and long compared with general terms. Therefore, limiting the label-label transitions as static numbers is a bottleneck in the performance improvement of BioNER.
We introduce DTranNER, a novel CRF-based framework incorporating a deep learning-based label-label transition model into BioNER. DTranNER uses two separate deep learning-based networks: Unary-Network and Pairwise-Network. The former is to model the input for determining individual labels, and the latter is to explore the context of the input for describing the label-label transitions. We performed experiments on five benchmark BioNER corpora. Compared with current state-of-the-art methods, DTranNER achieves the best F1-score of 84.56% beyond 84.40% on the BioCreative II gene mention (BC2GM) corpus, the best F1-score of 91.99% beyond 91.41% on the BioCreative IV chemical and drug (BC4CHEMD) corpus, the best F1-score of 94.16% beyond 93.44% on the chemical NER, the best F1-score of 87.22% beyond 86.56% on the disease NER of the BioCreative V chemical disease relation (BC5CDR) corpus, and a near-best F1-score of 88.62% on the NCBI-Disease corpus.
Our results indicate that the incorporation of the deep learning-based label-label transition model provides distinctive contextual clues to enhance BioNER over the static transition model. We demonstrate that the proposed framework enables the dynamic transition model to adaptively explore the contextual relations between adjacent labels in a fine-grained way. We expect that our study can be a stepping stone for further prosperity of biomedical literature mining.
生物医学命名实体识别(BioNER)通常通过将其视为序列标记问题来使用条件随机场(CRF)进行建模。基于 CRF 的方法通过在标签之间施加连接性来生成标签的结构化输出。最近的 BioNER 研究通过结合基于深度学习的模型(例如,双向长短期记忆)和 CRF 报告了最先进的性能。基于 CRF 的方法中的深度学习模型专门用于估计单个标签,而连接标签之间的关系则描述为静态数字;因此,在为给定输入句子生成最合理的标签-标签转换时,无法及时反映上下文。尽管如此,由于生物医学术语通常比一般术语更具描述性且更长,因此正确分割生物医学文本中的实体提及仍然具有挑战性。因此,将标签-标签转换限制为静态数字是 BioNER 性能提高的瓶颈。
我们引入了 DTranNER,这是一种新颖的基于 CRF 的框架,将基于深度学习的标签-标签转换模型集成到 BioNER 中。DTranNER 使用两个独立的基于深度学习的网络:Unary-Network 和 Pairwise-Network。前者用于对输入进行建模以确定单个标签,后者用于探索输入的上下文以描述标签-标签转换。我们在五个基准 BioNER 语料库上进行了实验。与当前最先进的方法相比,DTranNER 在 BioCreative II 基因提及(BC2GM)语料库上的最佳 F1 分数达到 84.56%,超过了 84.40%,在 BioCreative IV 化学和药物(BC4CHEMD)语料库上的最佳 F1 分数达到 91.99%,超过了 91.41%,在化学命名实体识别(Chemical NER)上的最佳 F1 分数达到 94.16%,超过了 93.44%,在 BioCreative V 化学疾病关系(BC5CDR)语料库上的最佳疾病命名实体识别(Disease NER)分数达到 87.22%,超过了 86.56%,在 NCBI-Disease 语料库上的近乎最佳 F1 分数达到 88.62%。
我们的结果表明,基于深度学习的标签-标签转换模型的引入为增强 BioNER 提供了独特的上下文线索,优于静态转换模型。我们证明,所提出的框架使动态转换模型能够以精细的方式自适应地探索相邻标签之间的上下文关系。我们希望我们的研究能够成为进一步繁荣生物医学文献挖掘的垫脚石。