Suppr超能文献

植物CAD2:一种用于被子植物跨物种功能注释的长上下文DNA语言模型。

PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms.

作者信息

Zhai Jingjing, Gokaslan Aaron, Hsu Sheng-Kai, Chen Szu-Ping, Liu Zong-Yan, Marroquin Edgar, Czech Eric, Cannon Betsy, Berthel Ana, Cinta Romay M, Pennell Matt, Kuleshov Volodymyr, Buckler Edward S

机构信息

Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853.

Department of Computer Science, Cornell University, Ithaca, NY, USA 14853.

出版信息

bioRxiv. 2025 Sep 1:2025.08.27.672609. doi: 10.1101/2025.08.27.672609.

Abstract

Understanding how DNA sequence encodes biological function remains a fundamental challenge in biology. Flowering plants (angiosperms), the dominant terrestrial clade, exhibit maximal biochemical complexity, extraordinary species diversity (over 100,000 species), relatively recent origins (~160 million years), ~200-fold variation in genome size and relative compact coding regions compared with other eukaryotes. These features present both a unique challenge and opportunity for pre-training DNA language models to understand plant-specific evolutionary conservation, regulatory architectures and genomic functions. Here, we introduce PlantCAD2, a long-context, plant-specific DNA language model with single-nucleotide resolution, pre-trained on 65 angiosperm genomes, together with a series of public benchmarks for evaluation. Comprehensive zero-shot testing shows that PlantCAD2 (676 million parameters) efficiently captures evolutionary conservation, surpassing the 7-billion-parameter Evo2 model in 10 of 12 tasks. With parameter-efficient fine-tuning, PlantCAD2 also outperforms the 1-billion-parameter AgroNT across seven cross-species tasks. Moreover, its 8 kb context window substantially improves accessible chromatin prediction in large genomes such as maize (AUPRC increasing from 0.587 to 0.711), underscoring the importance of long-range context for modeling distal regulation. Together, these results establish PlantCAD2 as a powerful, efficient, and versatile foundation model for plant genomics, enabling accurate genome annotation across diverse species.

摘要

理解DNA序列如何编码生物学功能仍然是生物学中的一项基本挑战。开花植物(被子植物)作为陆地优势类群,展现出最大程度的生化复杂性、非凡的物种多样性(超过10万种)、相对较近的起源(约1.6亿年)、基因组大小200倍的差异以及与其他真核生物相比相对紧凑的编码区域。这些特征为预训练DNA语言模型理解植物特有的进化保守性、调控结构和基因组功能带来了独特的挑战和机遇。在此,我们介绍PlantCAD2,这是一种具有单核苷酸分辨率的长上下文、植物特异性DNA语言模型,在65个被子植物基因组上进行了预训练,并带有一系列用于评估的公共基准。全面的零样本测试表明,PlantCAD2(6.76亿个参数)有效地捕捉了进化保守性,在12项任务中的10项上超过了70亿参数的Evo2模型。通过参数高效微调,PlantCAD2在七项跨物种任务中也优于10亿参数的AgroNT。此外,其8 kb的上下文窗口显著改善了玉米等大基因组中可及染色质的预测(AUPRC从0.587提高到0.711),强调了长程上下文对远端调控建模的重要性。总之,这些结果确立了PlantCAD2作为植物基因组学强大、高效且通用的基础模型,能够跨多种物种进行准确的基因组注释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8dee/12425018/8e610dd5808c/nihpp-2025.08.27.672609v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验