Lai Po-Ting, Wei Chih-Hsuan, Tian Shubo, Leaman Robert, Lu Zhiyong
ArXiv. 2025 Jan 23:arXiv:2501.14079v1.
Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the develop-ment of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relationships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results in-clude an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks. Our source code and dataset are available at https://github.com/ncbi-nlp/BioREDirect.
生物关系网络包含丰富的信息,有助于理解基因、蛋白质、疾病和化学物质等实体之间关系背后的生物学机制。生物医学文献的大量增长给更新网络知识带来了重大挑战。最近的生物医学关系提取数据集(BioRED)提供了有价值的人工注释,促进了用于自动识别新型文档级(句间上下文)关系的机器学习和预训练语言模型方法的发展。尽管如此,其注释缺乏实体角色的方向性(主语/宾语),而这对于研究复杂的生物网络至关重要。在此,我们对BioRED语料库中关系的实体角色进行注释,并随后提出一种具有软提示学习的新型多任务语言模型,以联合识别关系、新发现和实体角色。我们的结果包括一个丰富的BioRED语料库,其中有10,864个方向性注释。此外,我们提出的方法在两项基准任务上优于现有的大型语言模型,如最先进的GPT-4和Llama-3。我们的源代码和数据集可在https://github.com/ncbi-nlp/BioREDirect获取。