基于大型语言模型的框架，用于从非结构化数据中自动提取遗传相互作用。

Large language model based framework for automated extraction of genetic interactions from unstructured data.

机构信息

Health Innovation and Transformation Centre, Federation University, Ballarat, Victoria, Australia.

BioThink, Brisbane, Queensland, Australia.

出版信息

PLoS One. 2024 May 21;19(5):e0303231. doi: 10.1371/journal.pone.0303231. eCollection 2024.

DOI:10.1371/journal.pone.0303231

PMID:38771886

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11108146/

Abstract

Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.

摘要

从已发表的文献中提取生物相互作用有助于我们理解复杂的生物系统，加速研究，并为药物或治疗方法的开发提供决策支持。尽管使用文本挖掘工具和机器学习管道自动化提取生物关系的努力一直在进行，但人工注释仍然是黄金标准。然而，与生物关系相关的文献数量的快速增长给其人工注释和完善带来了挑战。这些挑战更加复杂，因为只有一小部分已发表的文献与生物关系提取相关，并且相关部分的嵌入式句子结构复杂，这可能导致关系的不正确推断。为了克服这些挑战，我们提出了 GIX，这是一个基于大型语言模型的自动化和稳健的基因相互作用提取框架，通过在包括 LLL 和 RegulonDB 在内的各种基因/蛋白质相互作用语料库上进行广泛评估进行了微调。GIX 使用最少的关键字识别相关出版物，优化句子选择以减少计算开销，简化句子结构同时保留含义，并提供表示提取关系可靠性的置信度因子。GIX 的第二阶段关系提取方法在使用 10 倍交叉验证评估的基准蛋白质/基因相互作用数据集上表现良好，超过了最先进的方法。我们证明了该方法虽然是全自动的，但与手动关系提取一样有效，并且具有更强的稳健性。我们还观察到 GIX 能够用新句子增强现有数据集，纳入新发现的生物术语和过程。此外，我们还证明了 GIX 在推断大肠杆菌基因电路方面的实际应用。