Li Yongbin, Hui Linhu, Zou Liping, Li Huyang, Xu Luo, Wang Xiaohua, Chua Stephanie
School of Medical Information Engineering, Zunyi Medical University, Zunyi, China.
Faculty of Computer Science and Information Technology, University Malaysia Sarawak, Sarawak, Malaysia.
JMIR Med Inform. 2022 Oct 20;10(10):e41136. doi: 10.2196/41136.
With the rapid expansion of biomedical literature, biomedical information extraction has attracted increasing attention from researchers. In particular, relation extraction between 2 entities is a long-term research topic.
This study aimed to perform 2 multiclass relation extraction tasks of Biomedical Natural Language Processing Workshop 2019 Open Shared Tasks: relation extraction of Bacteria-Biotope (BB-rel) task and binary relation extraction of plant seed development (SeeDev-binary) task. In essence, these 2 tasks are aimed at extracting the relation between annotated entity pairs from biomedical texts, which is a challenging problem.
Traditional research methods adopted feature- or kernel-based methods and achieved good performance. For these tasks, we propose a deep learning model based on a combination of several distributed features, such as domain-specific word embedding, part-of-speech embedding, entity-type embedding, distance embedding, and position embedding. The multi-head attention mechanism is used to extract the global semantic features of an entire sentence. Meanwhile, we introduced a dependency-type feature and the shortest dependency path connecting 2 candidate entities in the syntactic dependency graph to enrich the feature representation.
Experiments show that our proposed model has excellent performance in biomedical relation extraction, achieving F scores of 65.56% and 38.04% on the test sets of the BB-rel and SeeDev-binary tasks. Especially in the SeeDev-binary task, the F score of our model is superior to that of other existing models and achieves state-of-the-art performance.
We demonstrated that the multi-head attention mechanism can learn relevant syntactic and semantic features in different representation subspaces and different positions to extract comprehensive feature representation. Moreover, syntactic dependency features can improve the performance of the model by learning dependency relation between the entities in biomedical texts.
随着生物医学文献的迅速增长,生物医学信息提取已引起研究人员越来越多的关注。特别是,两个实体之间的关系提取是一个长期的研究课题。
本研究旨在执行2019年生物医学自然语言处理研讨会开放共享任务中的两个多类关系提取任务:细菌-生物群落关系提取(BB-rel)任务和植物种子发育二元关系提取(SeeDev-binary)任务。从本质上讲,这两个任务旨在从生物医学文本中提取注释实体对之间的关系,这是一个具有挑战性的问题。
传统研究方法采用基于特征或核的方法,并取得了良好的性能。对于这些任务,我们提出了一种基于多种分布式特征组合的深度学习模型,如特定领域词嵌入、词性嵌入、实体类型嵌入、距离嵌入和位置嵌入。多头注意力机制用于提取整个句子的全局语义特征。同时,我们引入了依存类型特征和句法依存图中连接两个候选实体的最短依存路径,以丰富特征表示。
实验表明,我们提出的模型在生物医学关系提取方面具有优异的性能,在BB-rel和SeeDev-binary任务的测试集上分别达到了65.56%和38.04%的F分数。特别是在SeeDev-binary任务中,我们模型的F分数优于其他现有模型,并达到了当前的最优性能。
我们证明了多头注意力机制可以在不同的表示子空间和不同位置学习相关的句法和语义特征,以提取综合特征表示。此外,句法依存特征可以通过学习生物医学文本中实体之间的依存关系来提高模型的性能。