• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FuncFetch:一种由大型语言模型辅助的工作流程能够从已发表的手稿中挖掘出数千种酶-底物相互作用。

FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.

作者信息

Smith Nathaniel, Yuan Xinyu, Melissinos Chesney, Moghe Gaurav

机构信息

Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, United States.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae756.

DOI:10.1093/bioinformatics/btae756
PMID:39718779
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11734755/
Abstract

MOTIVATION

Thousands of genomes are publicly available, however, most genes in those genomes have poorly defined functions. This is partly due to a gap between previously published, experimentally characterized protein activities and activities deposited in databases. This activity deposition is bottlenecked by the time-consuming biocuration process. The emergence of large language models presents an opportunity to speed up the text-mining of protein activities for biocuration.

RESULTS

We developed FuncFetch-a workflow that integrates NCBI E-Utilities, OpenAI's GPT-4, and Zotero-to screen thousands of manuscripts and extract enzyme activities. Extensive validation revealed high precision and recall of GPT-4 in determining whether the abstract of a given paper indicates the presence of a characterized enzyme activity in that paper. Provided the manuscript, FuncFetch extracted data such as species information, enzyme names, sequence identifiers, substrates, and products, which were subjected to extensive quality analyses. Comparison of this workflow against a manually curated dataset of BAHD acyltransferase activities demonstrated a precision/recall of 0.86/0.64 in extracting substrates. We further deployed FuncFetch on nine large plant enzyme families. Screening 26 543 papers, FuncFetch retrieved 32 605 entries from 5459 selected papers. We also identified multiple extraction errors including incorrect associations, nontarget enzymes, and hallucinations, which highlight the need for further manual curation. The BAHD activities were verified, resulting in a comprehensive functional fingerprint of this family and revealing that ∼70% of the experimentally characterized enzymes are uncurated in the public domain. FuncFetch represents an advance in biocuration and lays the groundwork for predicting the functions of uncharacterized enzymes.

AVAILABILITY AND IMPLEMENTATION

Code and minimally curated activities are available at: https://github.com/moghelab/funcfetch and https://tools.moghelab.org/funczymedb.

摘要

动机

数以千计的基因组已公开可用,然而,这些基因组中的大多数基因功能定义不明确。部分原因是先前发表的、经过实验表征的蛋白质活性与数据库中存储的活性之间存在差距。这种活性存储受到耗时的生物编目过程的瓶颈限制。大语言模型的出现为加速蛋白质活性的文本挖掘以进行生物编目提供了机会。

结果

我们开发了FuncFetch——一种集成了NCBI电子实用工具、OpenAI的GPT-4和Zotero的工作流程,用于筛选数千篇手稿并提取酶活性。广泛的验证表明,GPT-4在确定给定论文的摘要是否表明该论文中存在已表征的酶活性方面具有高精度和召回率。给定手稿后,FuncFetch提取了物种信息、酶名称、序列标识符、底物和产物等数据,并对这些数据进行了广泛的质量分析。将此工作流程与BAHD酰基转移酶活性的人工编目数据集进行比较,结果表明在提取底物方面的精确率/召回率为0.86/0.64。我们进一步将FuncFetch应用于九个大型植物酶家族。通过筛选26543篇论文,FuncFetch从5459篇选定论文中检索到32605条记录。我们还发现了多个提取错误,包括错误关联、非目标酶和幻觉,这凸显了进一步人工编目的必要性。对BAHD活性进行了验证,得到了该家族的全面功能指纹图谱,并揭示约70%经过实验表征的酶在公共领域未被编目。FuncFetch代表了生物编目方面的一项进展,并为预测未表征酶的功能奠定了基础。

可用性和实现方式

代码和最少编目的活性可在以下网址获取:https://github.com/moghelab/funcfetch和https://tools.moghelab.org/funczymedb。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a0d/11734755/bbda34db83da/btae756f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a0d/11734755/6946b76fb2cf/btae756f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a0d/11734755/bbda34db83da/btae756f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a0d/11734755/6946b76fb2cf/btae756f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1a0d/11734755/bbda34db83da/btae756f2.jpg

相似文献

1
FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.FuncFetch:一种由大型语言模型辅助的工作流程能够从已发表的手稿中挖掘出数千种酶-底物相互作用。
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae756.
2
Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.BioCreative VI 精准医学赛道概述:精准医学中的蛋白质相互作用和突变挖掘。
Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.
3
Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.生物信息学工作流程和文本挖掘:BioCreative 2012 研讨会第二轨道概述。
Database (Oxford). 2012 Nov 17;2012:bas043. doi: 10.1093/database/bas043. Print 2012.
4
BioCreative III interactive task: an overview.BioCreative III 交互式任务概述。
BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-12-S8-S4.
5
Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data.评估生成式人工智能在对照人工整理的遗传和基因组数据检索信息方面的性能。
Database (Oxford). 2025 Feb 17;2025. doi: 10.1093/database/baaf011.
6
Text mining for the biocuration workflow.文本挖掘在生物注释工作流中的应用。
Database (Oxford). 2012 Apr 18;2012:bas020. doi: 10.1093/database/bas020. Print 2012.
7
FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.FamPlex:生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。
BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.
8
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
9
Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy.基于关键词频率驱动的提示增强策略优化生物医学信息检索
BMC Bioinformatics. 2024 Aug 27;25(1):281. doi: 10.1186/s12859-024-05902-7.
10
Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation.蛋白质亚细胞定位的半自动管理:一种基于文本挖掘的基因本体论(GO)细胞组分管理方法。
BMC Bioinformatics. 2009 Jul 21;10:228. doi: 10.1186/1471-2105-10-228.

引用本文的文献

1
Finding the dark matter: Large language model-based enzyme kinetic data extractor and its validation.寻找暗物质:基于大语言模型的酶动力学数据提取器及其验证
Protein Sci. 2025 Sep;34(9):e70251. doi: 10.1002/pro.70251.
2
Advancing plant metabolic research by using large language models to expand databases and extract labeled data.通过使用大语言模型扩展数据库并提取标记数据来推进植物代谢研究。
Appl Plant Sci. 2025 May 14;13(4):e70007. doi: 10.1002/aps3.70007. eCollection 2025 Jul-Aug.
3
Prot2Chat: protein large language model with early fusion of text, sequence, and structure.

本文引用的文献

1
Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.评估GPT和BERT模型用于生物医学文本中蛋白质-蛋白质相互作用的识别
Bioinform Adv. 2024 Sep 11;4(1):vbae133. doi: 10.1093/bioadv/vbae133. eCollection 2024.
2
EnzChemRED, a rich enzyme chemistry relation extraction dataset.EnzChemRED,一个富含酶化学关系提取的数据集。
Sci Data. 2024 Sep 9;11(1):982. doi: 10.1038/s41597-024-03835-7.
3
Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature.
Prot2Chat:具有文本、序列和结构早期融合的蛋白质大语言模型。
Bioinformatics. 2025 Aug 1;41(8). doi: 10.1093/bioinformatics/btaf396.
测试基于人工智能的大型语言模型从科学文献中提取生态信息的可靠性。
NPJ Biodivers. 2024 May 16;3(1):13. doi: 10.1038/s44185-024-00043-9.
4
An evaluation of ChatGPT and Bard (Gemini) in the context of biological knowledge retrieval.在生物知识检索背景下对ChatGPT和Bard(Gemini)的评估。
Access Microbiol. 2024 Jun 12;6(6). doi: 10.1099/acmi.0.000790.v3. eCollection 2024.
5
Mapping of specialized metabolite terms onto a plant phylogeny using text mining and large language models.利用文本挖掘和大型语言模型将特征代谢物术语映射到植物系统发育树上。
Plant J. 2024 Oct;120(1):406-419. doi: 10.1111/tpj.16906. Epub 2024 Jul 8.
6
Augmenting large language models with chemistry tools.用化学工具增强大语言模型。
Nat Mach Intell. 2024;6(5):525-535. doi: 10.1038/s42256-024-00832-8. Epub 2024 May 8.
7
PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge.PubTator 3.0:一款人工智能驱动的文献资源,用于解锁生物医学知识。
Nucleic Acids Res. 2024 Jul 5;52(W1):W540-W546. doi: 10.1093/nar/gkae235.
8
High-throughput prediction of enzyme promiscuity based on substrate-product pairs.基于底物-产物对的酶多功能性高通量预测。
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae089.
9
The Arabidopsis Information Resource in 2024.2024 年的拟南芥信息资源。
Genetics. 2024 May 7;227(1). doi: 10.1093/genetics/iyae027.
10
Data extraction for evidence synthesis using a large language model: A proof-of-concept study.使用大型语言模型进行证据综合的数据提取:概念验证研究。
Res Synth Methods. 2024 Jul;15(4):576-589. doi: 10.1002/jrsm.1710. Epub 2024 Mar 3.