核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.
作者信息
Dalla-Torre Hugo, Gonzalez Liam, Mendoza-Revilla Javier, Lopez Carranza Nicolas, Grzywaczewski Adam Henryk, Oteri Francesco, Dallago Christian, Trop Evan, de Almeida Bernardo P, Sirelkhatim Hassan, Richard Guillaume, Skwark Marcin, Beguir Karim, Lopez Marie, Pierrot Thomas
机构信息
InstaDeep, London, UK.
Nvidia, Santa Clara, CA, USA.
出版信息
Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.
The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.
从DNA序列预测分子表型仍然是基因组学中一个长期存在的挑战,这通常是由注释数据有限以及无法在任务之间转移知识所驱动的。在这里,我们展示了一项对在DNA序列上预训练的基础模型的广泛研究,该模型名为核苷酸变换器,参数范围从5000万到25亿,并整合了来自3202个人类基因组和850个来自不同物种的基因组的信息。这些变换器模型产生核苷酸序列的上下文特定表示,即使在低数据设置下也能进行准确预测。我们表明,所开发的模型可以以低成本进行微调,以解决各种基因组学应用。尽管没有监督,这些模型学会了将注意力集中在关键基因组元件上,并可用于改进遗传变异的优先级排序。基础模型在基因组学中的训练和应用为从DNA序列准确预测分子表型提供了一种广泛适用的方法。