Ji Yanrong, Zhou Zhihan, Liu Han, Davuluri Ramana V
Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA.
Department of Computer Science, Northwestern University, Evanston, IL 60208, USA.
Bioinformatics. 2021 Aug 9;37(15):2112-2120. doi: 10.1093/bioinformatics/btab083.
Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.
To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks.
The source code, pretrained and finetuned model for DNABERT are available at GitHub (https://github.com/jerryji1993/DNABERT).
Supplementary data are available at Bioinformatics online.
解读非编码DNA的语言是基因组研究的基本问题之一。由于存在一词多义现象和远距离语义关系,基因调控代码高度复杂,以往的信息学方法往往无法捕捉到,尤其是在数据稀缺的情况下。
为应对这一挑战,我们开发了一种名为DNABERT的新型预训练双向编码器表示,以基于上下游核苷酸上下文捕获对基因组DNA序列的全局和可转移理解。我们将DNABERT与用于全基因组调控元件预测的最广泛使用的程序进行了比较,并展示了其易用性、准确性和效率。我们表明,单个预训练的变压器模型在使用小的特定任务标记数据进行简单微调后,能够在启动子、剪接位点和转录因子结合位点的预测上同时实现最先进的性能。此外,DNABERT能够直接可视化输入序列内核苷酸水平的重要性和语义关系,以实现更好的可解释性,并准确识别保守序列基序和功能性遗传变异候选物。最后,我们证明,用人类基因组预训练的DNABERT甚至可以很容易地应用于其他生物,且性能优异。我们预计,预训练的DNABERT模型可以针对许多其他序列分析任务进行微调。
DNABERT的源代码、预训练和微调模型可在GitHub(https://github.com/jerryji1993/DNABERT)上获取。
补充数据可在《生物信息学》在线获取。