Institute of Artificial Intelligence, Beihang University, Beijing 100191, China.
SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China.
Bioinformatics. 2022 Aug 10;38(16):3976-3983. doi: 10.1093/bioinformatics/btac422.
Biomedical Named Entity Recognition (BioNER) aims to identify biomedical domain-specific entities (e.g. gene, chemical and disease) from unstructured texts. Despite deep learning-based methods for BioNER achieving satisfactory results, there is still much room for improvement. Firstly, most existing methods use independent sentences as training units and ignore inter-sentence context, which usually leads to the labeling inconsistency problem. Secondly, previous document-level BioNER works have approved that the inter-sentence information is essential, but what information should be regarded as context remains ambiguous. Moreover, there are still few pre-training-based BioNER models that have introduced inter-sentence information. Hence, we propose a cache-based inter-sentence model called BioNER-Cache to alleviate the aforementioned problems.
We propose a simple but effective dynamic caching module to capture inter-sentence information for BioNER. Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context. Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train our model. We build a comprehensive benchmark on four biomedical datasets to evaluate the model performance fairly. Finally, extensive experiments clearly validate the superiority of our proposed BioNER-Cache compared with various state-of-the-art intra-sentence and inter-sentence baselines.
Code will be available at https://github.com/zgzjdx/BioNER-Cache.
Supplementary data are available at Bioinformatics online.
生物医学命名实体识别(BioNER)旨在从非结构化文本中识别生物医学领域特定的实体(例如基因、化学物质和疾病)。尽管基于深度学习的 BioNER 方法取得了令人满意的结果,但仍有很大的改进空间。首先,大多数现有方法使用独立的句子作为训练单元,忽略句子之间的上下文,这通常会导致标签不一致问题。其次,以前的文档级 BioNER 工作已经证明句子之间的信息是必不可少的,但哪些信息应该被视为上下文仍然不清楚。此外,基于预训练的 BioNER 模型引入句子间信息的模型仍然很少。因此,我们提出了一种基于缓存的句子间模型,称为 BioNER-Cache,以缓解上述问题。
我们提出了一种简单而有效的动态缓存模块,用于捕获 BioNER 中的句子间信息。具体来说,缓存存储由预定义的缓存规则约束的最近隐藏表示。模型使用查询和读取机制从缓存中检索类似的历史记录作为局部上下文。然后,采用基于注意力的门控网络,使用 BioBERT 生成与上下文相关的特征。为了动态更新缓存,我们设计了一个评分函数,并实现了一种多任务方法来联合训练我们的模型。我们在四个生物医学数据集上构建了一个综合基准来公平地评估模型性能。最后,大量实验清楚地验证了我们提出的 BioNER-Cache 与各种基于句子内和句子间的最先进基线相比的优越性。
代码将在 https://github.com/zgzjdx/BioNER-Cache 上提供。
补充数据可在生物信息学在线获得。