School of Information Science and Engineering, Shandong Normal University, Jinan, China.
School of Data and Computer Science, Shandong Women's University, Jinan, China.
Neural Netw. 2023 Nov;168:531-538. doi: 10.1016/j.neunet.2023.09.043. Epub 2023 Sep 29.
A significant amount of textual data has been produced in the biomedical area recently as a result of the advancement of biomedical technologies. Large-scale biomedical data can be automatically obtained with the help of distant supervision. However, the noisy data brought by distant supervision methods makes relation extraction tasks more difficult. Previous work has focused more on how to restore mislabeled relationships, but little attention has been paid to the importance of labeled entity locations for relationship extraction tasks. In this paper, we present a "four-stage" model based on BioBERT and Multi-Instance Learning by using entity position markers. Firstly, the sentence is marked with position. Secondly, BioBERT, a biomedical pre-trained language model, is used in the final sentence feature vector representation not only with the global position marker but also with the start and end marker of both the head and tail entity. Thirdly, the aggregation of sentence vectors in the bag is used as the vector feature of the bag by three aggregation methods, and the performance of different sentence feature vectors combined with different bag encoding methods is discussed. At last, relation classification is performed at the bag level. According to experimental results, the presented model significantly outperforms all baseline models and contributes to noise reduction. In addition, different bag encoding methods need to match corresponding sentence encoding representation to achieve the best performance.
最近,随着生物医学技术的进步,生物医学领域产生了大量的文本数据。借助远程监督,可以自动获取大规模的生物医学数据。然而,远程监督方法带来的嘈杂数据使得关系抽取任务更加困难。以前的工作更多地关注如何恢复错误标记的关系,但很少关注标记实体位置对关系抽取任务的重要性。在本文中,我们提出了一种基于 BioBERT 和多实例学习的“四阶段”模型,该模型使用实体位置标记符。首先,对句子进行位置标记。其次,使用生物医学预训练语言模型 BioBERT,不仅使用全局位置标记符,还使用头实体和尾实体的起始和结束标记符,对最终句子特征向量表示进行标记。然后,通过三种聚合方法,将袋子中句子向量的聚合作为袋子的向量特征,并讨论不同句子特征向量与不同袋子编码方法相结合的性能。最后,在袋子级别进行关系分类。根据实验结果,所提出的模型显著优于所有基线模型,并有助于减少噪声。此外,不同的袋子编码方法需要匹配相应的句子编码表示,以达到最佳性能。