Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India.
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110016, India; Regional Centre for Biotechnology (RCB), Faridabad, Haryana 121001, India.
J Mol Biol. 2023 Jul 1;435(13):168121. doi: 10.1016/j.jmb.2023.168121. Epub 2023 Apr 24.
Transcription factors (TF) recognize specific motifs in the genome that are typically 6-12 bp long to regulate various aspects of the cellular machinery. Presence of binding motifs and favorable genome accessibility are key drivers for a consistent TF-DNA interaction. Although these pre-requisites may occur thousands of times in the genome, there seems to be a high degree of selectivity for the sites that are actually bound. Here, we present a deep-learning framework that identifies and characterizes the upstream and downstream genetic elements to the binding motif, for their role in enforcing the mentioned selectivity. The proposed framework is based on an interpretable recurrent neural network architecture that enables for the relative analysis of sequence context features. We apply the framework to model twenty-six transcription factors and score the TF-DNA binding at a base-pair resolution. We find significant differences in activations of DNA context features for bound and unbound sequences. In addition to standardized evaluation protocols, we offer outstanding interpretability that enables us to identify and annotate DNA sequence with possible elements that modulate TF-DNA binding. Also, differences in data processing have a huge influence on the overall model performance. Overall, the proposed framework allows for novel insights on the non-coding genetic elements and their role in facilitating a stable TF-DNA interaction.
转录因子 (TF) 识别基因组中特定的基序,这些基序通常长 6-12 个碱基对,用于调节细胞机制的各个方面。结合基序的存在和有利的基因组可及性是 TF-DNA 相互作用一致性的关键驱动因素。尽管这些前提条件在基因组中可能发生数千次,但实际上结合的位点似乎具有高度的选择性。在这里,我们提出了一个深度学习框架,用于识别和描述结合基序上游和下游的遗传元件,以研究它们在增强上述选择性方面的作用。所提出的框架基于可解释的递归神经网络架构,能够对序列上下文特征进行相对分析。我们将该框架应用于模拟二十六种转录因子,并以碱基对分辨率对 TF-DNA 结合进行评分。我们发现结合和未结合序列的 DNA 上下文特征的激活存在显著差异。除了标准化的评估协议外,我们还提供了出色的可解释性,使我们能够识别和注释可能调节 TF-DNA 结合的 DNA 序列元素。此外,数据处理的差异对整体模型性能有巨大影响。总的来说,该框架允许对非编码遗传元件及其在促进稳定 TF-DNA 相互作用中的作用有新的认识。