Suppr超能文献

DNABERT:用于基因组中DNA语言的基于变换器的预训练双向编码器表征模型。

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

作者信息

Ji Yanrong, Zhou Zhihan, Liu Han, Davuluri Ramana V

机构信息

Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA.

Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.

出版信息

Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.

Abstract

MOTIVATION

Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.

RESULTS

To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.

AVAILABILITY AND IMPLEMENTATION

The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

解读非编码DNA的语言是基因组研究的基本问题之一。由于存在一词多义现象和远距离语义关系,基因调控代码高度复杂,以往的信息学方法往往无法捕捉到,尤其是在数据稀缺的情况下。

结果

为应对这一挑战,我们开发了一种名为DNABERT的新型预训练双向编码器表示,以基于上下游核苷酸上下文捕获对基因组DNA序列的全局和可转移理解。我们将DNABERT与用于全基因组调控元件预测的最广泛使用的程序进行了比较,并展示了其易用性、准确性和效率。我们表明,单个预训练的变压器模型在使用小的特定任务标记数据进行简单微调后,能够在启动子、剪接位点和转录因子结合位点的预测上同时实现最先进的性能。此外,DNABERT能够直接可视化输入序列内核苷酸水平的重要性和语义关系,以实现更好的可解释性,并准确识别保守序列基序和功能性遗传变异候选物。最后,我们证明,用人类基因组预训练的DNABERT甚至可以很容易地应用于其他生物,且性能优异。我们预计,预训练的DNABERT模型可以针对许多其他序列分析任务进行微调。

可用性和实现方式

DNABERT的源代码、预训练和微调模型可在GitHub(https://github.com/jerryji1993/DNABERT)上获取。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

引用本文的文献

8
NextVir: Enabling classification of tumor-causing viruses with genomic foundation models.NextVir:利用基因组基础模型实现致瘤病毒分类
PLoS Comput Biol. 2025 Aug 21;21(8):e1013360. doi: 10.1371/journal.pcbi.1013360. eCollection 2025 Aug.

本文引用的文献

1
Novel Transformer Networks for Improved Sequence Labeling in genomics.用于改善基因组学中序列标记的新型 Transformer 网络。
IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):97-106. doi: 10.1109/TCBB.2020.3035021. Epub 2022 Feb 3.
2
DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins.深度绑定:增强对DNA结合蛋白序列特异性的预测
Proceedings (IEEE Int Conf Bioinformatics Biomed). 2016 Dec;2016:178-183. doi: 10.1109/bibm.2016.7822515. Epub 2017 Jan 19.
5
Determinants of enhancer and promoter activities of regulatory elements.调控元件增强子和启动子活性的决定因素。
Nat Rev Genet. 2020 Feb;21(2):71-87. doi: 10.1038/s41576-019-0173-8. Epub 2019 Oct 11.
8
10
A primer on deep learning in genomics.深度学习在基因组学中的应用简介。
Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验