Wang Kai, Zeng Xuan, Zhou Jingwen, Liu Fei, Luan Xiaoli, Wang Xinglong
Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China.
Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China.
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae195.
Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
转录因子(TFs)是通过与DNA序列中的转录因子结合位点(TFBSs)结合来调节基因转录所必需的蛋白质。准确预测TFBSs有助于基于转录因子的代谢调控系统的设计与构建。尽管已经开发了各种深度学习算法来预测TFBSs,但预测性能仍有待提高。本文提出了一种基于变换器双向编码器表征(BERT)的模型,称为BERT-TFBS,用于仅基于DNA序列预测TFBSs。该模型由一个预训练的BERT模块(DNABERT-2)、一个卷积神经网络(CNN)模块、一个卷积块注意力模块(CBAM)和一个输出模块组成。BERT-TFBS模型利用预训练的DNABERT-2模块,通过迁移学习方法获取DNA序列中复杂的长期依赖性,并应用CNN模块和CBAM提取高阶局部特征。所提出的模型基于165个ENCODE ChIP-seq数据集进行训练和测试。我们对模型变体进行了实验、跨细胞系验证并与其他模型进行了比较。实验结果证明了BERT-TFBS在预测TFBSs方面的有效性和泛化能力,并且表明所提出的模型优于其他深度学习模型。BERT-TFBS的源代码可在https://github.com/ZX1998-12/BERT-TFBS获取。