Hung Yen University of Technology and Education, Viet Nam; University of Engineering and Technology, Vietnam National University, Hanoi, Viet Nam.
Hung Yen University of Technology and Education, Viet Nam.
J Biomed Inform. 2024 Aug;156:104674. doi: 10.1016/j.jbi.2024.104674. Epub 2024 Jun 11.
Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities.
To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence.
Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER.
The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.
生物医学命名实体识别(bio NER)是识别生物医学文本中命名实体的任务。本文介绍了一种新的模型,通过考虑额外的外部上下文来解决 bio NER 问题。与主要使用原始输入序列进行序列标记的先前方法不同,该模型考虑了额外的上下文,以增强原始序列中实体的表示,因为额外的上下文可以为生物医学实体的概念解释提供增强的信息。
为了利用额外的上下文,给定原始输入序列,模型首先从 PubMed 中检索相关句子,然后对检索到的句子进行排序,形成上下文。然后,它将上下文与原始输入序列相结合,形成新的增强序列。原始序列和新增强序列都被馈送到 PubMedBERT 中进行特征表示学习。为了获得更精细的特征,模型在 PubMedBERT 上堆叠了一个 BiLSTM 层。最终的命名实体标签预测是通过使用 CRF 层完成的。模型以端到端的方式联合训练,以利用额外的上下文进行原始序列的 NER。
在六个生物医学数据集上的实验结果表明,与强大的基线相比,所提出的模型取得了有前途的性能,并证实了额外上下文对 bio NER 的贡献。
有前途的结果证实了三点。首先,来自 PubMed 的额外上下文有助于提高生物医学实体识别的质量。其次,与谷歌搜索引擎相比,PubMed 更适合为 bio NER 提供相关信息。最后,来自上下文的更多相关句子比不相关句子更有利于为原始输入序列提供增强信息。该模型灵活,可以集成任何类型的额外上下文来完成 NER 任务。