School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China.
School of Computer Science, University of South China, Hengyang, Hunan 421001, China.
Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae461.
Transcription factors are pivotal in the regulation of gene expression, and accurate identification of transcription factor binding sites (TFBSs) at high resolution is crucial for understanding the mechanisms underlying gene regulation. The task of identifying TFBSs from DNA sequences is a significant challenge in the field of computational biology today. To address this challenge, a variety of computational approaches have been developed. However, these methods face limitations in their ability to achieve high-resolution identification and often lack interpretability.
We propose BertSNR, an interpretable deep learning framework for identifying TFBSs at single-nucleotide resolution. BertSNR integrates sequence-level and token-level information by multi-task learning based on pre-trained DNA language models. Benchmarking comparisons show that our BertSNR outperforms the existing state-of-the-art methods in TFBS predictions. Importantly, we enhanced the interpretability of the model through attentional weight visualization and motif analysis, and discovered the subtle relationship between attention weight and motif. Moreover, BertSNR effectively identifies TFBSs in promoter regions, facilitating the study of intricate gene regulation.
The BertSNR source code can be found at https://github.com/lhy0322/BertSNR.
转录因子在基因表达调控中起着关键作用,准确识别转录因子结合位点(TFBSs)对于理解基因调控的机制至关重要。从 DNA 序列中识别 TFBSs 是当今计算生物学领域的一个重大挑战。为了应对这一挑战,已经开发了多种计算方法。然而,这些方法在实现高分辨率识别方面存在局限性,并且往往缺乏可解释性。
我们提出了 BertSNR,这是一种用于单核苷酸分辨率识别 TFBS 的可解释深度学习框架。BertSNR 通过基于预训练 DNA 语言模型的多任务学习整合了序列级和标记级信息。基准比较表明,我们的 BertSNR 在 TFBS 预测方面优于现有的最先进方法。重要的是,我们通过注意力权重可视化和基序分析增强了模型的可解释性,并发现了注意力权重和基序之间的微妙关系。此外,BertSNR 有效地识别了启动子区域中的 TFBS,有助于研究复杂的基因调控。
BertSNR 的源代码可以在 https://github.com/lhy0322/BertSNR 找到。