Alexandari Amr M, Horton Connor A, Shrikumar Avanti, Shah Nilay, Li Eileen, Weilert Melanie, Pufall Miles A, Zeitlinger Julia, Fordyce Polly M, Kundaje Anshul
Department of Computer Science, Stanford University, Stanford, CA 94305.
Department of Genetics, Stanford University, Stanford, CA 94305.
bioRxiv. 2023 May 11:2023.05.11.540401. doi: 10.1101/2023.05.11.540401.
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, binding profiles. Conversely, deep learning models, trained on TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of and TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of binding, suggest that deep learning models of binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput experiments to explore the influence of sequence context and variation on both intrinsic affinity and occupancy.
转录因子(TF)是一类蛋白质,它们以序列特异性的方式结合DNA以调控基因转录。尽管转录因子具有独特的内在序列偏好,但在不同的细胞环境中,其基因组占据图谱存在差异。因此,解析转录因子结合的序列决定因素,包括内在因素和特定环境因素,对于理解基因调控以及调控性非编码遗传变异的影响至关重要。基于转录因子结合实验训练的生物物理模型可以估计内在亲和力图谱,并根据转录因子浓度和亲和力预测占据情况。然而,这些模型无法充分解释特定环境下的结合图谱。相反,基于转录因子结合实验训练的深度学习模型能够有效地将基因组占据图谱预测并解释为复杂调控序列语法的函数,尽管缺乏清晰的生物物理解释。为了协调这些互补的转录因子结合模型,我们开发了亲和力蒸馏(AD)方法,该方法通过消除基因组序列背景的影响,从转录因子染色质免疫沉淀(ChIP)实验的深度学习模型中提取热力学亲和力。将AD应用于对不同类别的酵母和哺乳动物转录因子进行建模的神经网络,与基于基序的方法相比,AD能够通过具有更高动态范围和准确性的各种实验,预测基序内部和周围序列变异对转录因子结合的能量影响。此外,AD能够准确辨别转录因子旁系同源物的亲和力。我们的结果强调了热力学亲和力是结合的关键决定因素,表明结合的深度学习模型隐含地学习了高分辨率的亲和力图谱,并表明这些亲和力可以通过AD成功蒸馏出来。对深度学习模型的这种新的生物物理解释使得高通量实验能够探索序列背景和变异对内在亲和力和占据情况的影响。