SyRACT：基于协同检索增强生成（RAG）和思维链（CoT）的零样本生物医学文档级关系抽取

SyRACT: zero-shot biomedical document-level relation extraction with synergistic RAG and CoT.

作者信息

Dong Xin, Zhao Di, Meng Jiana, Guo Bocheng, Lin Hongfei

机构信息

School of Computer Science and Engineering, Dalian Minzu University, Liaoning 116600, China.

School of Computer Science and Technology, Dalian University of Technology, Liaoning 116024, China.

出版信息

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf356.

DOI:10.1093/bioinformatics/btaf356

PMID:40577808

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12237500/

Abstract

MOTIVATION

With the advancement of large language models (LLMs), the field of biomedical document-level relation extraction (BioDocRE) has encountered new opportunities. However, LLMs often face challenges such as hallucinated generation, insufficient reasoning capabilities, and a lack of interpretability when performing relation extraction tasks.

RESULTS

To address these issues, we propose the SyRACT (Synergistic Retrieval Augmented Generation and Chain of Thought) framework for high precision relation extraction in biomedical documents. This framework is built around three core strategies: (i) reframing the relation extraction task as a question answering problem to better align with the processing logic of LLMs; (ii) leveraging an external database constructed from PubMed to provide LLMs with rich and reliable contextual information, thus mitigating hallucination generation; and (iii) construct a specific Chain of Thought for BioDocRE tasks, thereby enhancing the model's reasoning ability and the interpretability of its output. We validated this approach on three biomedical relation extraction datasets: CDR, GDA, and ADE. Experimental results show that the SyRACT model improves F1 scores by 11.04%, 9.10%, and 41.00% on three datasets, respectively, compared to the DocRE method, which uses standard prompts for LLMs.

AVAILABILITY AND IMPLEMENTATION

Our source code and data are available at https://github.com/donggggxin/SyRACT.

摘要

动机

随着大语言模型（LLMs）的发展，生物医学文档级关系抽取（BioDocRE）领域迎来了新机遇。然而，大语言模型在执行关系抽取任务时常常面临诸如生成幻觉、推理能力不足以及缺乏可解释性等挑战。

结果

为解决这些问题，我们提出了用于生物医学文档高精度关系抽取的SyRACT（协同检索增强生成与思维链）框架。该框架围绕三个核心策略构建：（i）将关系抽取任务重新构建为问答问题，以更好地与大语言模型的处理逻辑相匹配；（ii）利用从PubMed构建的外部数据库为大语言模型提供丰富且可靠的上下文信息，从而减轻幻觉生成；（iii）为BioDocRE任务构建特定的思维链，进而增强模型的推理能力及其输出的可解释性。我们在三个生物医学关系抽取数据集（CDR、GDA和ADE）上验证了这种方法。实验结果表明，与使用标准提示的大语言模型的DocRE方法相比，SyRACT模型在三个数据集上的F1分数分别提高了11.04%、9.10%和41.00%。