He Jia, Zhang Yupeng, Liu Yuhang, Zhou Zhigan, Li Tianhao, Zhang Yongqing, Xie Boqia
School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China.
Department of Cardiology, Cardiovascualr Imaging Center, Beijing Chaoyang Hospital, Capital Medical University, Beijing, China.
Methods. 2025 Feb;234:141-151. doi: 10.1016/j.ymeth.2024.12.006. Epub 2024 Dec 17.
Transcription factor binding sites (TFBSs) are critical in regulating gene expression. Precisely locating TFBSs can reveal the mechanisms of action of different transcription factors in gene transcription. Various deep learning methods have been proposed to predict TFBS; however, these models often need help demonstrating ideal performance under limited data conditions. Furthermore, these models typically have complex structures, which makes their decision-making processes difficult to transparentize. Addressing these issues, we have developed a framework named BCDB. This framework integrates multi-scale DNA information and employs a dual-branch output strategy. Integrating DNABERT, convolutional neural networks (CNN), and multi-head attention mechanisms enhances the feature extraction capabilities, significantly improving the accuracy of predictions. This innovative method aims to balance the extraction of global and local information, enhancing predictive performance while utilizing attention mechanisms to provide an intuitive way to explain the model's predictions, thus strengthening the overall interpretability of the model. Prediction results on 165 ChIP-seq datasets show that BCDB significantly outperforms other existing deep learning methods in terms of performance. Additionally, since the BCDB model utilizes transfer learning methods, it can transfer knowledge learned from many unlabeled data to specific cell line prediction tasks, allowing our model to achieve cross-cell line TFBS prediction. The source code for BCDB is available on https://github.com/ZhangLab312/BCDB.
转录因子结合位点(TFBSs)在调节基因表达中起着关键作用。精确定位TFBSs可以揭示不同转录因子在基因转录中的作用机制。已经提出了各种深度学习方法来预测TFBS;然而,这些模型在有限的数据条件下往往难以展现出理想的性能。此外,这些模型通常结构复杂,这使得它们的决策过程难以透明化。为了解决这些问题,我们开发了一个名为BCDB的框架。该框架整合了多尺度DNA信息,并采用了双分支输出策略。整合DNABERT、卷积神经网络(CNN)和多头注意力机制增强了特征提取能力,显著提高了预测的准确性。这种创新方法旨在平衡全局和局部信息的提取,在利用注意力机制提高预测性能的同时,提供一种直观的方式来解释模型的预测,从而增强模型的整体可解释性。在165个ChIP-seq数据集上的预测结果表明,BCDB在性能方面显著优于其他现有的深度学习方法。此外,由于BCDB模型利用了迁移学习方法,它可以将从许多未标记数据中学到的知识转移到特定细胞系的预测任务中,使我们的模型能够实现跨细胞系的TFBS预测。BCDB的源代码可在https://github.com/ZhangLab312/BCDB上获取。