Benos Panayiotis V, Lapedes Alan S, Stormo Gary D
Department of Genetics, School of Medicine, Washington University, Campus Box 8232, St. Louis, MO 63110, USA.
J Mol Biol. 2002 Nov 1;323(4):701-27. doi: 10.1016/s0022-2836(02)00917-8.
A recognition code for protein-DNA interactions would allow for the prediction of binding sites based on protein sequence, and the identification of binding proteins for specific DNA targets. Crystallographic studies of protein-DNA complexes showed that a simple, deterministic recognition code does not exist. Here, we present a probabilistic recognition code (P-code) that assigns energies to all possible base-pair-amino acid interactions for the early growth response factor (EGR) family of zinc-finger transcription factors. The specific energy values are determined by a maximum likelihood method using examples from in vitro randomisation experiments (namely, SELEX and phage display) reported in the literature. The accuracy of the model is tested in several ways, including the ability to predict in vivo binding sites of EGR proteins and other non-EGR zinc-finger proteins, and the correlation between predicted and measured binding affinities of various EGR proteins to several different DNA sites. We also show that this model improves significantly upon the prediction capabilities of previous qualitative and quantitative models. The probabilistic code we develop uses information about the interacting positions between the protein and DNA, but we show that such information is not necessary, although it reduces the number of parameters to be determined. We also employ the assumption that the total binding energy is the sum of the energies of the individual contacts, but we describe how that assumption can be relaxed at the cost of additional parameters.
蛋白质 - DNA相互作用的识别码将有助于基于蛋白质序列预测结合位点,并识别特定DNA靶标的结合蛋白。蛋白质 - DNA复合物的晶体学研究表明,不存在简单的、确定性的识别码。在此,我们提出一种概率识别码(P码),它为锌指转录因子的早期生长反应因子(EGR)家族的所有可能碱基对 - 氨基酸相互作用赋予能量。特定能量值通过最大似然法,利用文献中报道的体外随机化实验(即SELEX和噬菌体展示)的实例来确定。该模型的准确性通过多种方式进行测试,包括预测EGR蛋白和其他非EGR锌指蛋白的体内结合位点的能力,以及各种EGR蛋白与几个不同DNA位点的预测结合亲和力与测量结合亲和力之间的相关性。我们还表明,该模型在预测能力上比以前的定性和定量模型有显著提高。我们开发的概率码使用了有关蛋白质和DNA之间相互作用位置的信息,但我们表明,尽管这会减少待确定的参数数量,但此类信息并非必需。我们还采用了总结合能是各个接触点能量之和的假设,但我们描述了如何以增加参数为代价放宽该假设。