• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。

Nucleotide Transformer: building and evaluating robust foundation models for human genomics.

作者信息

Dalla-Torre Hugo, Gonzalez Liam, Mendoza-Revilla Javier, Lopez Carranza Nicolas, Grzywaczewski Adam Henryk, Oteri Francesco, Dallago Christian, Trop Evan, de Almeida Bernardo P, Sirelkhatim Hassan, Richard Guillaume, Skwark Marcin, Beguir Karim, Lopez Marie, Pierrot Thomas

机构信息

InstaDeep, London, UK.

Nvidia, Santa Clara, CA, USA.

出版信息

Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.

DOI:10.1038/s41592-024-02523-z
PMID:39609566
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11810778/
Abstract

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.

摘要

从DNA序列预测分子表型仍然是基因组学中一个长期存在的挑战,这通常是由注释数据有限以及无法在任务之间转移知识所驱动的。在这里,我们展示了一项对在DNA序列上预训练的基础模型的广泛研究,该模型名为核苷酸变换器,参数范围从5000万到25亿,并整合了来自3202个人类基因组和850个来自不同物种的基因组的信息。这些变换器模型产生核苷酸序列的上下文特定表示,即使在低数据设置下也能进行准确预测。我们表明,所开发的模型可以以低成本进行微调,以解决各种基因组学应用。尽管没有监督,这些模型学会了将注意力集中在关键基因组元件上,并可用于改进遗传变异的优先级排序。基础模型在基因组学中的训练和应用为从DNA序列准确预测分子表型提供了一种广泛适用的方法。

相似文献

1
Nucleotide Transformer: building and evaluating robust foundation models for human genomics.核苷酸变换器:构建和评估用于人类基因组学的强大基础模型。
Nat Methods. 2025 Feb;22(2):287-297. doi: 10.1038/s41592-024-02523-z. Epub 2024 Nov 28.
2
Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models.利用基因组基础模型增强 DNA 序列的个性化基因表达预测。
HGG Adv. 2024 Oct 10;5(4):100347. doi: 10.1016/j.xhgg.2024.100347. Epub 2024 Aug 27.
3
Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models.通过微调预先训练的基因组模型来增强对功能表型序列的识别和解释。
J Transl Med. 2024 Aug 12;22(1):756. doi: 10.1186/s12967-024-05567-z.
4
Embed-Search-Align: DNA sequence alignment using Transformer models.嵌入-搜索-对齐:使用Transformer模型进行DNA序列比对
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf041.
5
Distinguishing word identity and sequence context in DNA language models.在 DNA 语言模型中区分单词身份和序列上下文。
BMC Bioinformatics. 2024 Sep 13;25(1):301. doi: 10.1186/s12859-024-05869-5.
6
STICI: Split-Transformer with integrated convolutions for genotype imputation.STICI:用于基因型填充的集成卷积拆分变压器
Nat Commun. 2025 Jan 31;16(1):1218. doi: 10.1038/s41467-025-56273-3.
7
Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing.生物信息学方法在基因组学和下一代测序的后基因组学应用。
Brief Bioinform. 2010 Mar;11(2):181-97. doi: 10.1093/bib/bbp046. Epub 2009 Oct 27.
8
A dictionary based informational genome analysis.基于字典的信息基因组分析。
BMC Genomics. 2012 Sep 17;13:485. doi: 10.1186/1471-2164-13-485.
9
Floating Search Methodology for Combining Classification Models for Site Recognition in DNA Sequences.用于 DNA 序列中站点识别的分类模型组合的浮动搜索方法。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2471-2482. doi: 10.1109/TCBB.2020.2974221. Epub 2021 Dec 8.
10
Do it yourself guide to genome assembly.基因组组装自助指南。
Brief Funct Genomics. 2016 Jan;15(1):1-9. doi: 10.1093/bfgp/elu042. Epub 2014 Nov 11.

引用本文的文献

1
Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术
Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.
2
ARCADE: Controllable Codon Design from Foundation Models via Activation Engineering.ARCADE:通过激活工程从基础模型进行可控密码子设计
bioRxiv. 2025 Aug 23:2025.08.19.668819. doi: 10.1101/2025.08.19.668819.
3
Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics.

本文引用的文献

1
A foundational large language model for edible plant genomes.食用植物基因组的基础大语言模型。
Commun Biol. 2024 Jul 9;7(1):835. doi: 10.1038/s42003-024-06465-2.
2
DNA language models are powerful predictors of genome-wide variant effects.DNA 语言模型是全基因组变异效应的有力预测因子。
Proc Natl Acad Sci U S A. 2023 Oct 31;120(44):e2311219120. doi: 10.1073/pnas.2311219120. Epub 2023 Oct 26.
3
Light attention predicts protein location from the language of life.轻注意力从生命语言中预测蛋白质位置。
使用变异体预训练基因组语言模型以更好地建模功能基因组学。
bioRxiv. 2025 Aug 23:2025.02.26.640468. doi: 10.1101/2025.02.26.640468.
4
Machine learning tools for deciphering the regulatory logic of enhancers in health and disease.用于解读健康与疾病中增强子调控逻辑的机器学习工具
Front Genet. 2025 Aug 13;16:1603687. doi: 10.3389/fgene.2025.1603687. eCollection 2025.
5
NextVir: Enabling classification of tumor-causing viruses with genomic foundation models.NextVir:利用基因组基础模型实现致瘤病毒分类
PLoS Comput Biol. 2025 Aug 21;21(8):e1013360. doi: 10.1371/journal.pcbi.1013360. eCollection 2025 Aug.
6
NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model.NetStart 2.0:使用蛋白质语言模型预测真核生物翻译起始位点
BMC Bioinformatics. 2025 Aug 19;26(1):216. doi: 10.1186/s12859-025-06220-2.
7
Tokenization and deep learning architectures in genomics: A comprehensive review.基因组学中的词法分析与深度学习架构:全面综述
Comput Struct Biotechnol J. 2025 Jul 28;27:3547-3555. doi: 10.1016/j.csbj.2025.07.038. eCollection 2025.
8
MKFGO: integrating multi-source knowledge fusion with pretrained language model for high-accuracy protein function prediction.MKFGO:将多源知识融合与预训练语言模型相结合用于高精度蛋白质功能预测
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf420.
9
De novo prediction of functional effects of genetic variants from DNA sequences based on context-specific molecular information.基于上下文特异性分子信息从DNA序列对遗传变异的功能效应进行从头预测。
Front Syst Biol. 2024 Jun 3;4:1402664. doi: 10.3389/fsysb.2024.1402664. eCollection 2024.
10
Computational methods for alternative polyadenylation and splicing in post-transcriptional gene regulation.转录后基因调控中可变聚腺苷酸化和剪接的计算方法
Exp Mol Med. 2025 Aug 14. doi: 10.1038/s12276-025-01496-z.
Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.
4
High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.对扩展的 1000 基因组项目队列进行高覆盖率全基因组测序,包括 602 个三核苷酸重复序列。
Cell. 2022 Sep 1;185(18):3426-3440.e19. doi: 10.1016/j.cell.2022.08.004.
5
A sequence-based global map of regulatory activity for deciphering human genetics.基于序列的人类遗传学解码调控活性的全局图谱。
Nat Genet. 2022 Jul;54(7):940-949. doi: 10.1038/s41588-022-01102-2. Epub 2022 Jul 11.
6
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers.DeepSTARR 可根据 DNA 序列预测增强子活性,并能够从头设计合成增强子。
Nat Genet. 2022 May;54(5):613-624. doi: 10.1038/s41588-022-01048-5. Epub 2022 May 12.
7
Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。
Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.
8
Protein embeddings and deep learning predict binding residues for various ligand classes.蛋白质嵌入和深度学习预测各种配体类的结合残基。
Sci Rep. 2021 Dec 13;11(1):23916. doi: 10.1038/s41598-021-03431-4.
9
Effective gene expression prediction from sequence by integrating long-range interactions.通过整合长程相互作用,从序列中有效预测基因表达。
Nat Methods. 2021 Oct;18(10):1196-1203. doi: 10.1038/s41592-021-01252-x. Epub 2021 Oct 4.
10
Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression.大规模顺式和反式 eQTL 分析确定了数千个调节血液基因表达的遗传位点和多基因评分。
Nat Genet. 2021 Sep;53(9):1300-1310. doi: 10.1038/s41588-021-00913-z. Epub 2021 Sep 2.