DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

作者信息

Ji Yanrong, Zhou Zhihan, Liu Han, Davuluri Ramana V

机构信息

Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA.

Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.

出版信息

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

DOI:10.1093/bioinformatics/btab083

PMID:33538820

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11025658/

Abstract

MOTIVATION

Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.

RESULTS

To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.

AVAILABILITY AND IMPLEMENTATION

The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

解读非编码DNA的语言是基因组研究的基本问题之一。由于存在一词多义现象和远距离语义关系，基因调控代码高度复杂，以往的信息学方法往往无法捕捉到，尤其是在数据稀缺的情况下。

结果

为应对这一挑战，我们开发了一种名为DNABERT的新型预训练双向编码器表示，以基于上下游核苷酸上下文捕获对基因组DNA序列的全局和可转移理解。我们将DNABERT与用于全基因组调控元件预测的最广泛使用的程序进行了比较，并展示了其易用性、准确性和效率。我们表明，单个预训练的变压器模型在使用小的特定任务标记数据进行简单微调后，能够在启动子、剪接位点和转录因子结合位点的预测上同时实现最先进的性能。此外，DNABERT能够直接可视化输入序列内核苷酸水平的重要性和语义关系，以实现更好的可解释性，并准确识别保守序列基序和功能性遗传变异候选物。最后，我们证明，用人类基因组预训练的DNABERT甚至可以很容易地应用于其他生物，且性能优异。我们预计，预训练的DNABERT模型可以针对许多其他序列分析任务进行微调。

可用性和实现方式

DNABERT的源代码、预训练和微调模型可在GitHub（https://github.com/jerryji1993/DNABERT）上获取。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.DNABERT：用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding.基于剪枝预训练 DNABert-Pruning 模型和融合人工特征编码的 4mC 位点识别算法。

Anal Biochem. 2024 Jun;689:115492. doi: 10.1016/j.ab.2024.115492. Epub 2024 Mar 6.

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings.DNABERT-S：利用物种感知DNA嵌入技术开创物种分化研究

ArXiv. 2024 Oct 22:arXiv:2402.08777v3.

BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning.BERT-TFBS：一种基于迁移学习的用于预测转录因子结合位点的新型基于BERT的模型。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae195.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.研究非标准预训练的核苷酸序列上的 BERT 模型，并评估不同的 k-mer 嵌入。

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad617.

MRM-BERT: a novel deep neural network predictor of multiple RNA modifications by fusing BERT representation and sequence features.MRM-BERT：一种新颖的深度学习神经网络，通过融合 BERT 表示和序列特征，预测多种 RNA 修饰。

RNA Biol. 2024 Jan;21(1):1-10. doi: 10.1080/15476286.2024.2315384. Epub 2024 Feb 15.

Highly accurate classification of chest radiographic reports using a deep learning natural language model pre-trained on 3.8 million text reports.利用在 380 万份文本报告上预训练的深度学习自然语言模型，实现胸部 X 光报告的高精度分类。

Bioinformatics. 2021 Jan 29;36(21):5255-5261. doi: 10.1093/bioinformatics/btaa668.

A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation.基于 Transformer 的双向编码器表示模型的精细调整在食品命名实体识别中的应用：算法开发与验证。

J Med Internet Res. 2021 Aug 9;23(8):e28229. doi: 10.2196/28229.

Prediction of RNA-protein interactions using a nucleotide language model.使用核苷酸语言模型预测RNA-蛋白质相互作用。

Bioinform Adv. 2022 Apr 7;2(1):vbac023. doi: 10.1093/bioadv/vbac023. eCollection 2022.

引用本文的文献

ARCADE: Controllable Codon Design from Foundation Models via Activation Engineering.ARCADE：通过激活工程从基础模型进行可控密码子设计

bioRxiv. 2025 Aug 23:2025.08.19.668819. doi: 10.1101/2025.08.19.668819.

Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics.使用变异体预训练基因组语言模型以更好地建模功能基因组学。

bioRxiv. 2025 Aug 23:2025.02.26.640468. doi: 10.1101/2025.02.26.640468.

Machine learning tools for deciphering the regulatory logic of enhancers in health and disease.用于解读健康与疾病中增强子调控逻辑的机器学习工具

Front Genet. 2025 Aug 13;16:1603687. doi: 10.3389/fgene.2025.1603687. eCollection 2025.

WaveSeekerNet: accurate prediction of influenza A virus subtypes and host source using attention-based deep learning.WaveSeekerNet：基于注意力机制的深度学习对甲型流感病毒亚型和宿主来源的准确预测

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf089.

Thrifty wide-context models of B cell receptor somatic hypermutation.B细胞受体体细胞超突变的节俭宽背景模型

Elife. 2025 Aug 29;14:RP105471. doi: 10.7554/eLife.105471.

BBANsh: a deep learning architecture based on BERT and bilinear attention networks to identify potent shRNA.BBANsh：一种基于BERT和双线性注意力网络的深度学习架构，用于识别有效的短发夹RNA。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf443.

Prediction of exosomal miRNA-based biomarkers for liquid biopsy.基于外泌体微小RNA的液体活检生物标志物预测

Sci Rep. 2025 Aug 25;15(1):31191. doi: 10.1038/s41598-025-15814-y.

NextVir: Enabling classification of tumor-causing viruses with genomic foundation models.NextVir：利用基因组基础模型实现致瘤病毒分类

PLoS Comput Biol. 2025 Aug 21;21(8):e1013360. doi: 10.1371/journal.pcbi.1013360. eCollection 2025 Aug.

LocPro: A deep learning-based prediction of protein subcellular localization for promoting multi-directional pharmaceutical research.LocPro：基于深度学习的蛋白质亚细胞定位预测，以促进多方向药物研究。

J Pharm Anal. 2025 Aug;15(8):101255. doi: 10.1016/j.jpha.2025.101255. Epub 2025 Mar 5.

Population health management through human phenotype ontology with policy for ecosystem improvement.通过人类表型本体进行人群健康管理并制定生态系统改善政策。

Front Artif Intell. 2025 Aug 1;8:1496937. doi: 10.3389/frai.2025.1496937. eCollection 2025.

本文引用的文献

Novel Transformer Networks for Improved Sequence Labeling in genomics.用于改善基因组学中序列标记的新型 Transformer 网络。

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):97-106. doi: 10.1109/TCBB.2020.3035021. Epub 2022 Feb 3.

DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins.深度绑定：增强对DNA结合蛋白序列特异性的预测

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2016 Dec;2016:178-183. doi: 10.1109/bibm.2016.7822515. Epub 2017 Jan 19.

In silico analysis of alternative splicing on drug-target gene interactions.基于药物靶点基因相互作用的剪接异构体的计算机分析。

Sci Rep. 2020 Jan 10;10(1):134. doi: 10.1038/s41598-019-56894-x.

SpliceFinder: ab initio prediction of splice sites using convolutional neural network.SpliceFinder：使用卷积神经网络进行剪接位点的从头预测。

BMC Bioinformatics. 2019 Dec 27;20(Suppl 23):652. doi: 10.1186/s12859-019-3306-3.

Determinants of enhancer and promoter activities of regulatory elements.调控元件增强子和启动子活性的决定因素。

Nat Rev Genet. 2020 Feb;21(2):71-87. doi: 10.1038/s41576-019-0173-8. Epub 2019 Oct 11.

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study.基于大规模电子健康记录笔记对基于变换器的双向编码器表征（BERT）模型进行微调：一项实证研究。

JMIR Med Inform. 2019 Sep 12;7(3):e14830. doi: 10.2196/14830.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

DeePromoter: Robust Promoter Predictor Using Deep Learning.DeePromoter：使用深度学习的强大启动子预测器。

Front Genet. 2019 Apr 5;10:286. doi: 10.3389/fgene.2019.00286. eCollection 2019.

Promoter analysis and prediction in the human genome using sequence-based deep learning models.基于序列的深度学习模型在人类基因组中的启动子分析和预测。

Bioinformatics. 2019 Aug 15;35(16):2730-2737. doi: 10.1093/bioinformatics/bty1068.

A primer on deep learning in genomics.深度学习在基因组学中的应用简介。

Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验