通过结构-序列优化捕获的可解释蛋白质-DNA相互作用

Interpretable protein-DNA interactions captured by structure-sequence optimization.

作者信息

Zhang Yafan, Silvernail Irene, Lin Zhuyang, Lin Xingcheng

机构信息

Bioinformatics Research Center, North Carolina State University, Raleigh, United States.

Department of Physics, North Carolina State University, Raleigh, United States.

出版信息

Elife. 2025 Jul 17;14:RP105565. doi: 10.7554/eLife.105565.

DOI:10.7554/eLife.105565

PMID:40673435

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12270484/

Abstract

Sequence-specific DNA recognition underlies essential processes in gene regulation, yet methods for simultaneous predictions of genomic DNA recognition sites and their binding affinity remain lacking. Here, we present the Interpretable protein-DNA Energy Associative (IDEA) model, a residue-level, interpretable biophysical model capable of predicting binding sites and affinities of DNA-binding proteins. By fusing structures and sequences of known protein-DNA complexes into an optimized energy model, IDEA enables direct interpretation of physicochemical interactions among individual amino acids and nucleotides. We demonstrate that this energy model can accurately predict DNA recognition sites and their binding strengths across various protein families. Additionally, the IDEA model is integrated into a coarse-grained simulation framework that quantitatively captures the absolute protein-DNA binding free energies. Overall, IDEA provides an integrated computational platform that alleviates experimental costs and biases in assessing DNA recognition and can be utilized for mechanistic studies of various DNA-recognition processes.

摘要

序列特异性DNA识别是基因调控中基本过程的基础，但目前仍缺乏同时预测基因组DNA识别位点及其结合亲和力的方法。在此，我们提出了可解释的蛋白质-DNA能量关联（IDEA）模型，这是一种残基水平、可解释的生物物理模型，能够预测DNA结合蛋白的结合位点和亲和力。通过将已知蛋白质-DNA复合物的结构和序列融合到一个优化的能量模型中，IDEA能够直接解释单个氨基酸和核苷酸之间的物理化学相互作用。我们证明，这种能量模型可以准确预测各种蛋白质家族的DNA识别位点及其结合强度。此外，IDEA模型被整合到一个粗粒度模拟框架中，该框架定量地捕捉了蛋白质-DNA的绝对结合自由能。总体而言，IDEA提供了一个综合计算平台，可减轻评估DNA识别过程中的实验成本和偏差，并可用于各种DNA识别过程的机制研究。