通过使用大语言模型扩展数据库并提取标记数据来推进植物代谢研究。

Advancing plant metabolic research by using large language models to expand databases and extract labeled data.

作者信息

Knapp Rachel, Johnson Braidon, Busta Lucas

机构信息

Department of Chemistry and Biochemistry University of Minnesota Duluth Duluth Minnesota USA.

Department of Chemical Engineering University of Minnesota Duluth Duluth Minnesota USA.

出版信息

Appl Plant Sci. 2025 May 14;13(4):e70007. doi: 10.1002/aps3.70007. eCollection 2025 Jul-Aug.

DOI:10.1002/aps3.70007

PMID:40766897

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12319720/

Abstract

PREMISE

Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic.

METHODS

Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme-product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound-species associations. Finally, we build and determine the accuracy of a multimodal language model-based pipeline that transcribes images of tables into machine-readable formats.

RESULTS

When tuned for each specific task, these methods perform with high (80-90%) or modest (50%) accuracies for enzyme-product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound-species pair identification.

DISCUSSION

We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.

摘要

前提

最近，植物科学在序列和化学数据的可扩展数据收集方面取得了变革性进展。这些大型数据集与机器学习相结合，表明大规模开展植物代谢研究能产生显著的见解。随着可访问的大语言模型的出现，扩大规模的关键下一步已经显现，即使在其早期阶段，这些模型也能从文献中提炼结构化数据。这使我们更接近创建整合几乎所有关于某个主题的已发表知识的专业数据库。

方法

在这里，我们首先测试提示工程技术和语言模型的不同组合在识别经过验证的酶-产物对方面的效果。接下来，我们评估自动提示工程和检索增强生成在识别化合物-物种关联方面的应用。最后，我们构建并确定基于多模态语言模型的管道的准确性，该管道将表格图像转录为机器可读格式。

结果

针对每个特定任务进行调整后，这些方法在酶-产物对识别和表格图像转录方面具有较高（80-90%）或中等（50%）的准确率，但在化合物-物种对识别方面的假阴性率低于以前的方法（从55%降至40%）。

讨论

我们为使用语言模型的研究人员列举了几点建议，其中包括用户特定领域专业知识和知识的重要性。

相似文献

Advancing plant metabolic research by using large language models to expand databases and extract labeled data.通过使用大语言模型扩展数据库并提取标记数据来推进植物代谢研究。

Appl Plant Sci. 2025 May 14;13(4):e70007. doi: 10.1002/aps3.70007. eCollection 2025 Jul-Aug.

Short-Term Memory Impairment短期记忆障碍

Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

The effect of sample site and collection procedure on identification of SARS-CoV-2 infection.样本采集部位和采集程序对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）感染鉴定的影响。

Cochrane Database Syst Rev. 2024 Dec 16;12(12):CD014780. doi: 10.1002/14651858.CD014780.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.降低男男性行为者中艾滋病毒性传播风险的行为干预措施。

Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.利用基础模型库进行跨设备肿瘤显微镜检查中的细胞相似性搜索。

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.对紫杉醇、多西他赛、吉西他滨和长春瑞滨在非小细胞肺癌中的临床疗效和成本效益进行的快速系统评价。

Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.

The Black Book of Psychotropic Dosing and Monitoring.《精神药物剂量与监测黑皮书》

Psychopharmacol Bull. 2024 Jul 8;54(3):8-59.

本文引用的文献

Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge.用于提取分子相互作用和通路知识的大语言模型的比较性能评估

J Comput Biol. 2025 Jul;32(7):675-695. doi: 10.1089/cmb.2025.0078. Epub 2025 May 19.

FuncFetch: an LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts.FuncFetch：一种由大型语言模型辅助的工作流程能够从已发表的手稿中挖掘出数千种酶-底物相互作用。

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btae756.

Assessment of fine-tuned large language models for real-world chemistry and material science applications.用于实际化学和材料科学应用的微调大语言模型评估。

Chem Sci. 2024 Nov 22;16(2):670-684. doi: 10.1039/d4sc04401k. eCollection 2025 Jan 2.

The landscape of biomedical research.生物医学研究的全景

Patterns (N Y). 2024 Apr 9;5(6):100968. doi: 10.1016/j.patter.2024.100968. eCollection 2024 Jun 14.

Mapping of specialized metabolite terms onto a plant phylogeny using text mining and large language models.利用文本挖掘和大型语言模型将特征代谢物术语映射到植物系统发育树上。

Plant J. 2024 Oct;120(1):406-419. doi: 10.1111/tpj.16906. Epub 2024 Jul 8.

The LOTUS initiative for open knowledge management in natural products research.天然产物研究中开放知识管理的 LOTUS 计划。

Elife. 2022 May 26;11:e70780. doi: 10.7554/eLife.70780.

Phytosterol Profiling of Apiaceae Family Seeds Spices Using GC-MS.采用气相色谱-质谱联用技术对伞形科种子香料进行植物甾醇分析

Foods. 2021 Oct 8;10(10):2378. doi: 10.3390/foods10102378.

Plant Metabolic Network 15: A resource of genome-wide metabolism databases for 126 plants and algae.植物代谢网络 15：126 种植物和藻类的全基因组代谢数据库资源。

J Integr Plant Biol. 2021 Nov;63(11):1888-1905. doi: 10.1111/jipb.13163. Epub 2021 Oct 27.

Phytosterol Contents of Edible Oils and Their Contributions to Estimated Phytosterol Intake in the Chinese Diet.食用油中的植物甾醇含量及其对中国饮食中植物甾醇估计摄入量的贡献。

Foods. 2019 Aug 9;8(8):334. doi: 10.3390/foods8080334.

Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking.通过全球天然产物社会分子网络共享和社区管理质谱数据。

Nat Biotechnol. 2016 Aug 9;34(8):828-837. doi: 10.1038/nbt.3597.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过使用大语言模型扩展数据库并提取标记数据来推进植物代谢研究。

Advancing plant metabolic research by using large language models to expand databases and extract labeled data.

作者信息

机构信息

出版信息

PREMISE

METHODS

RESULTS

DISCUSSION

前提

方法

结果

讨论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献