FuncFetch：一种由大型语言模型辅助的工作流程能够从已发表的手稿中挖掘出数千种酶-底物相互作用。

FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.

作者信息

Smith Nathaniel, Yuan Xinyu, Melissinos Chesney, Moghe Gaurav

机构信息

Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae756.

DOI:10.1093/bioinformatics/btae756

PMID:39718779

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11734755/

Abstract

MOTIVATION

Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration.

RESULTS

We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes.

AVAILABILITY AND IMPLEMENTATION

Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.

摘要

动机

数以千计的基因组已公开可用，然而，这些基因组中的大多数基因功能定义不明确。部分原因是先前发表的、经过实验表征的蛋白质活性与数据库中存储的活性之间存在差距。这种活性存储受到耗时的生物编目过程的瓶颈限制。大语言模型的出现为加速蛋白质活性的文本挖掘以进行生物编目提供了机会。

结果

我们开发了FuncFetch——一种集成了NCBI电子实用工具、OpenAI的GPT-4和Zotero的工作流程，用于筛选数千篇手稿并提取酶活性。广泛的验证表明，GPT-4在确定给定论文的摘要是否表明该论文中存在已表征的酶活性方面具有高精度和召回率。给定手稿后，FuncFetch提取了物种信息、酶名称、序列标识符、底物和产物等数据，并对这些数据进行了广泛的质量分析。将此工作流程与BAHD酰基转移酶活性的人工编目数据集进行比较，结果表明在提取底物方面的精确率/召回率为0.86/0.64。我们进一步将FuncFetch应用于九个大型植物酶家族。通过筛选26543篇论文，FuncFetch从5459篇选定论文中检索到32605条记录。我们还发现了多个提取错误，包括错误关联、非目标酶和幻觉，这凸显了进一步人工编目的必要性。对BAHD活性进行了验证，得到了该家族的全面功能指纹图谱，并揭示约70%经过实验表征的酶在公共领域未被编目。FuncFetch代表了生物编目方面的一项进展，并为预测未表征酶的功能奠定了基础。