Arc Institute, Palo Alto, CA, USA.
Department of Bioengineering, Stanford University, Stanford, CA, USA.
Science. 2024 Nov 15;386(6723):eado9336. doi: 10.1126/science.ado9336.
The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.
基因组是一个序列,它编码 DNA、RNA 和蛋白质,这些蛋白质协调着生物体的功能。我们提出了 Evo,这是一个基于前沿架构的长上下文基因组基础模型,它在数百万个原核生物和噬菌体基因组上进行了训练,并报告了 DNA 上的规模定律,以补充语言和视觉方面的观察。Evo 可以跨 DNA、RNA 和蛋白质进行泛化,能够实现零样本功能预测,与特定于领域的语言模型竞争,并生成功能性 CRISPR-Cas 和转座子系统,这代表了使用语言模型进行蛋白质-RNA 和蛋白质-DNA 代码设计的首例。Evo 还学习了如何小的突变会影响整个生物体的适应性,并生成具有合理基因组结构的兆碱基规模的序列。这些预测和生成能力跨越了从分子到基因组的复杂尺度,推进了我们对生物学的理解和控制。