Yang Fan, Kong Huanjun, Ying Jie, Chen Zihong, Luo Tao, Jiang Wanli, Yuan Zhonghang, Wang Zhefan, Ma Zhaona, Wang Shikuan, Ma Wanfeng, Wang Xiaoyi, Li Xiaoying, Hu Zhengyin, Ma Xiaodong, Liu Minguo, Wang Xiqing, Chen Fan, Dong Nanqing
Yazhouwan National Laboratory, Sanya 572025, China.
Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China.
Mol Plant. 2025 Jul 7;18(7):1118-1129. doi: 10.1016/j.molp.2025.05.013. Epub 2025 May 28.
Rice biology research involves complex decision-making, requiring researchers to navigate a rapidly expanding body of knowledge encompassing extensive literature and multiomics data. The exponential increase in biological data and scientific publications presents significant challenges for efficiently extracting meaningful insights. Although large language models (LLMs) show promise for knowledge retrieval, their application to rice-specific research has been limited by the absence of specialized models and the challenge of synthesizing multimodal data integral to the field. Moreover, the lack of standardized evaluation frameworks for domain-specific tasks impedes the effective assessment of model performance. To address these challenges, we introduce SeedLLM·Rice (SeedLLM), a 7-billion-parameter model trained on 1.4 million rice-related publications, representing nearly 98.24% of global rice research output. Additionally, we present a novel human-centric evaluation framework designed to assess LLM performance in rice biology tasks. Initial evaluations demonstrate that SeedLLM outperforms general-purpose models, including OpenAI GPT-4o1 and DeepSeek-R1, achieving win rates of 57% to 88% on rice-specific tasks. Furthermore, SeedLLM is integrated with the Rice Biological Knowledge Graph (RBKG), which consolidates genome annotations for Nipponbare and large-scale synthesis of transcriptomic and proteomic information from over 1800 studies. This integration enhances the ability of SeedLLM to address complex research questions requiring the fusion of textual and multiomics data. To facilitate global collaboration, we provide free access to SeedLLM and the RBKG via an interactive web portal (https://seedllm.org.cn/). SeedLLM represents a transformative tool for rice biology research, enabling unprecedented discoveries in crop improvement and climate adaptation through advanced reasoning and comprehensive data integration.
水稻生物学研究涉及复杂的决策过程,要求研究人员在迅速扩展的知识体系中前行,这些知识涵盖了大量文献和多组学数据。生物数据和科学出版物的指数级增长给有效提取有意义的见解带来了重大挑战。尽管大语言模型(LLMs)在知识检索方面展现出了潜力,但其在水稻特定研究中的应用受到了缺乏专门模型以及整合该领域不可或缺的多模态数据的挑战的限制。此外,针对特定领域任务缺乏标准化的评估框架阻碍了对模型性能的有效评估。为应对这些挑战,我们推出了SeedLLM·Rice(SeedLLM),这是一个拥有70亿参数的模型,它基于140万篇与水稻相关的出版物进行训练,这些出版物几乎占全球水稻研究产出的98.24%。此外,我们还提出了一个全新的以人类为中心的评估框架,旨在评估大语言模型在水稻生物学任务中的性能。初步评估表明,SeedLLM优于包括OpenAI GPT - 4o1和DeepSeek - R1在内的通用模型,在水稻特定任务上的胜率达到了57%至88%。此外,SeedLLM与水稻生物知识图谱(RBKG)集成,该图谱整合了日本晴的基因组注释以及来自1800多项研究的转录组和蛋白质组信息的大规模合成。这种整合增强了SeedLLM解决需要融合文本和多组学数据的复杂研究问题的能力。为促进全球合作,我们通过一个交互式网络门户(https://seedllm.org.cn/)提供对SeedLLM和RBKG的免费访问。SeedLLM代表了水稻生物学研究的一个变革性工具,通过先进的推理和全面的数据整合,能够在作物改良和气候适应方面实现前所未有的发现。