Xu Feng, Wu Tianhao, Cheng Qian, Wang Xiangfeng, Yan Jun
Frontiers Science Center for Molecular Design Breeding, State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, China.
Front Plant Sci. 2025 Jun 3;16:1611992. doi: 10.3389/fpls.2025.1611992. eCollection 2025.
A foundation model (FM) is a neural network trained on large-scale data using unsupervised or self-supervised learning, capable of adapting to a wide range of downstream tasks. This review provides a comprehensive overview of FMs in plant molecular biology, emphasizing recent advances and future directions. It begins by tracing the evolution of biological FMs across the DNA, RNA, protein, and single-cell levels, from tools inspired by natural language processing (NLP) to transformative models for decoding complex biological sequences. The review then focuses on plant-specific FMs such as GPN, AgroNT, PDLLMs, PlantCaduceus, and PlantRNA-FM, which address challenges that are widespread among plant genomes, including polyploidy, high repetitive sequence content, and environment-responsive regulatory elements, alongside universal FMs like GENERator and Evo 2, which leverage extensive cross-species training data for sequence design and prediction of mutation effects. Key opportunities and challenges in plant molecular biology FM development are further outlined, such as data heterogeneity, biologically informed architectures, cross-species generalization, and computational efficiency. Future research should prioritize improvements in model generalization, multi-modal data integration, and computational optimization to overcome existing limitations and unlock the potential of FMs in plant science. This review serves as an essential resource for plant molecular biologists and offers a clear snapshot of the current state and future potential of FMs in the field.
基础模型(FM)是一种通过无监督或自监督学习在大规模数据上训练的神经网络,能够适应广泛的下游任务。本综述全面概述了基础模型在植物分子生物学中的应用,重点介绍了近期进展和未来方向。首先追溯了生物基础模型在DNA、RNA、蛋白质和单细胞水平上的发展历程,从受自然语言处理(NLP)启发的工具到用于解码复杂生物序列的变革性模型。接着,综述聚焦于植物特异性基础模型,如GPN、AgroNT、PDLLMs、PlantCaduceus和PlantRNA-FM,这些模型解决了植物基因组中普遍存在的挑战,包括多倍体、高重复序列含量和环境响应调控元件,同时也介绍了像GENERator和Evo 2这样的通用基础模型,它们利用广泛的跨物种训练数据进行序列设计和突变效应预测。进一步概述了植物分子生物学基础模型开发中的关键机遇和挑战,如数据异质性、生物信息架构、跨物种泛化和计算效率。未来的研究应优先改进模型泛化、多模态数据整合和计算优化,以克服现有局限性并释放基础模型在植物科学中的潜力。本综述是植物分子生物学家的重要资源,清晰呈现了该领域基础模型的当前状态和未来潜力。