通过使用大语言模型扩展数据库并提取标记数据来推进植物代谢研究。

Advancing plant metabolic research by using large language models to expand databases and extract labeled data.

作者信息

Knapp Rachel, Johnson Braidon, Busta Lucas

机构信息

Department of Chemistry and Biochemistry University of Minnesota Duluth Duluth Minnesota USA.

Department of Chemical Engineering University of Minnesota Duluth Duluth Minnesota USA.

出版信息

Appl Plant Sci. 2025 May 14;13(4):e70007. doi: 10.1002/aps3.70007. eCollection 2025 Jul-Aug.

Abstract

PREMISE

Recently, plant science has seen transformative advances in scalable data collection for sequence and chemical data. These large datasets, combined with machine learning, have demonstrated that conducting plant metabolic research on large scales yields remarkable insights. A key next step in increasing scale has been revealed with the advent of accessible large language models, which, even in their early stages, can distill structured data from the literature. This brings us closer to creating specialized databases that consolidate virtually all published knowledge on a topic.

METHODS

Here, we first test different combinations of prompt engineering techniques and language models in the identification of validated enzyme-product pairs. Next, we evaluate the application of automated prompt engineering and retrieval-augmented generation to identify compound-species associations. Finally, we build and determine the accuracy of a multimodal language model-based pipeline that transcribes images of tables into machine-readable formats.

RESULTS

When tuned for each specific task, these methods perform with high (80-90%) or modest (50%) accuracies for enzyme-product pair identification and table image transcription, but with lower false-negative rates than previous methods (decreasing from 55% to 40%) for compound-species pair identification.

DISCUSSION

We enumerate several suggestions for researchers working with language models, among which is the importance of the user's domain-specific expertise and knowledge.

摘要

前提

最近,植物科学在序列和化学数据的可扩展数据收集方面取得了变革性进展。这些大型数据集与机器学习相结合,表明大规模开展植物代谢研究能产生显著的见解。随着可访问的大语言模型的出现,扩大规模的关键下一步已经显现,即使在其早期阶段,这些模型也能从文献中提炼结构化数据。这使我们更接近创建整合几乎所有关于某个主题的已发表知识的专业数据库。

方法

在这里,我们首先测试提示工程技术和语言模型的不同组合在识别经过验证的酶-产物对方面的效果。接下来,我们评估自动提示工程和检索增强生成在识别化合物-物种关联方面的应用。最后,我们构建并确定基于多模态语言模型的管道的准确性,该管道将表格图像转录为机器可读格式。

结果

针对每个特定任务进行调整后,这些方法在酶-产物对识别和表格图像转录方面具有较高(80-90%)或中等(50%)的准确率,但在化合物-物种对识别方面的假阴性率低于以前的方法(从55%降至40%)。

讨论

我们为使用语言模型的研究人员列举了几点建议,其中包括用户特定领域专业知识和知识的重要性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索