Kono H, Sarai A
Tsukuba Life Science Center, The Institute of Physical & Chemical Research (RIKEN), Ibaraki, Japan.
Proteins. 1999 Apr 1;35(1):114-31.
Regulatory proteins play a critical role in controlling complex spatial and temporal patterns of gene expression in higher organism, by recognizing multiple DNA sequences and regulating multiple target genes. Increasing amounts of structural data on the protein-DNA complex provides clues for the mechanism of target recognition by regulatory proteins. The analyses of the propensities of base-amino acid interactions observed in those structural data show that there is no one-to-one correspondence in the interaction, but clear preferences exist. On the other hand, the analysis of spatial distribution of amino acids around bases shows that even those amino acids with strong base preference such as Arg with G are distributed in a wide space around bases. Thus, amino acids with many different geometries can form a similar type of interaction with bases. The redundancy and structural flexibility in the interaction suggest that there are no simple rules in the sequence recognition, and its prediction is not straightforward. However, the spatial distributions of amino acids around bases indicate a possibility that the structural data can be used to derive empirical interaction potentials between amino acids and bases. Such information extracted from structural databases has been successfully used to predict amino acid sequences that fold into particular protein structures. We surmised that the structures of protein-DNA complexes could be used to predict DNA target sites for regulatory proteins, because determining DNA sequences that bind to a particular protein structure should be similar to finding amino acid sequences that fold into a particular structure. Here we demonstrate that the structural data can be used to predict DNA target sequences for regulatory proteins. Pairwise potentials that determine the interaction between bases and amino acids were empirically derived from the structural data. These potentials were then used to examine the compatibility between DNA sequences and the protein-DNA complex structure in a combinatorial "threading" procedure. We applied this strategy to the structures of protein-DNA complexes to predict DNA binding sites recognized by regulatory proteins. To test the applicability of this method in target-site prediction, we examined the effects of cognate and noncognate binding, cooperative binding, and DNA deformation on the binding specificity, and predicted binding sites in real promoters and compared with experimental data. These results show that target binding sites for several regulatory proteins are successfully predicted, and our data suggest that this method can serve as a powerful tool for predicting multiple target sites and target genes for regulatory proteins.
调控蛋白通过识别多个DNA序列并调控多个靶基因,在控制高等生物中复杂的基因表达时空模式方面发挥着关键作用。越来越多的蛋白质-DNA复合物结构数据为调控蛋白的靶标识别机制提供了线索。对这些结构数据中观察到的碱基-氨基酸相互作用倾向的分析表明,这种相互作用不存在一一对应关系,但存在明显的偏好。另一方面,对碱基周围氨基酸空间分布的分析表明,即使是那些对碱基有强烈偏好的氨基酸,如与鸟嘌呤(G)结合的精氨酸(Arg),也分布在碱基周围的广阔空间中。因此,具有许多不同几何形状的氨基酸可以与碱基形成类似类型的相互作用。这种相互作用中的冗余性和结构灵活性表明,序列识别没有简单的规则,其预测也并非易事。然而,碱基周围氨基酸的空间分布表明,结构数据有可能用于推导氨基酸与碱基之间的经验性相互作用势。从结构数据库中提取的此类信息已成功用于预测折叠成特定蛋白质结构的氨基酸序列。我们推测,蛋白质-DNA复合物的结构可用于预测调控蛋白的DNA靶位点,因为确定与特定蛋白质结构结合的DNA序列应该类似于寻找折叠成特定结构的氨基酸序列。在此,我们证明结构数据可用于预测调控蛋白的DNA靶序列。从结构数据中凭经验推导了决定碱基与氨基酸之间相互作用的成对势。然后,在组合“穿线”程序中,利用这些势来检验DNA序列与蛋白质-DNA复合物结构之间的兼容性。我们将此策略应用于蛋白质-DNA复合物的结构,以预测调控蛋白识别的DNA结合位点。为了测试该方法在靶位点预测中的适用性,我们研究了同源和非同源结合、协同结合以及DNA变形对结合特异性的影响,并预测了真实启动子中的结合位点,并与实验数据进行了比较。这些结果表明,成功预测了几种调控蛋白的靶结合位点,我们的数据表明该方法可作为预测调控蛋白多个靶位点和靶基因的有力工具。