Wojcik J, Mornon J P, Chomilier J
Systèmes Moléculaires et Biologie Structurale Laboratoire de Minéralogie-Cristallographie (LMCP), Universités Paris VI et Paris VII, Cedex 05, Paris, CNRS UMR7590, France.
J Mol Biol. 1999 Jun 25;289(5):1469-90. doi: 10.1006/jmbi.1999.2826.
A bank of 13,563 loops from three to eight amino acid residues long, representing motifs between two consecutive regular secondary structures, has been derived from protein structures presenting less than 95 % sequence identity. Statistical analyses of occurrences of conformations and residues revealed length-dependent over-representations of particular amino acids (glycine, proline, asparagine, serine, and aspartate) and conformations (alphaL, epsilon, betaPregions of the Ramachandran plot). A position-dependent distribution of these occurrences was observed for N and C-terminal residues, which are correlated to the nature of the flanking regions. Loops of the same length were clustered into statistically meaningful families on the basis of their backbone structures when placed in a common reference frame, independent of the flanks. These clusters present significantly different distributions of sequence, conformations, and endpoint residue Calphadistances. On the basis of the sequence-structure correlation of this clustering, an automatic loop modeling algorithm was developed. Based on the knowledge of its sequence and of its flank backbone structures each query loop is assigned to a family and target loop supports are selected in this family. The support backbones of these target loops are then adjusted on flanking structures by partial exploration of the conformational space. Loop closure is performed by energy minimization for each support and the final model is chosen among connected supports based upon energy criteria. The quality of the prediction is evaluated by the root-mean-square deviation (rmsd) between the final model and the native loops when the whole bank is re-attributed on itself with a Jackknife test. This average rmsd ranges from 1.1 A for three-residue loops to 3.8 A for eight-residue loops. A few poorly predicted loops are inescapable, considering the high level of diversity in loops and the lack of environment data. To overcome such modeling problems, a statistical reliability score was assigned for each prediction. This score is correlated to the quality of the prediction, in terms of rmsd, and thus improves the selection accuracy of the model. The algorithm efficiency was compared to CASP3 target loop predictions. Moreover, when tested on a test loop bank, this algorithm was shown to be robust when the loops are not precisely delimited, therefore proving to be a useful tool in practice for protein modeling.
从序列同一性低于95%的蛋白质结构中获得了一组由13563个环组成的文库,这些环长度为三至八个氨基酸残基,代表两个连续规则二级结构之间的基序。对构象和残基出现情况的统计分析揭示了特定氨基酸(甘氨酸、脯氨酸、天冬酰胺、丝氨酸和天冬氨酸)和构象(拉氏图中的αL、ε、βP区域)在长度上的过度代表性。观察到N端和C端残基的这些出现情况存在位置依赖性分布,这与侧翼区域的性质相关。当置于共同参考框架中时,相同长度的环基于其主链结构被聚类为具有统计学意义的家族,与侧翼无关。这些聚类呈现出序列、构象和端点残基Cα距离的显著不同分布。基于这种聚类的序列-结构相关性,开发了一种自动环建模算法。根据查询环的序列及其侧翼主链结构的知识,将每个查询环分配到一个家族,并在该家族中选择目标环支撑。然后通过对构象空间的部分探索,在侧翼结构上调整这些目标环的支撑主链。通过对每个支撑进行能量最小化来实现环闭合,并根据能量标准从相连的支撑中选择最终模型。当用刀切法对整个文库进行重新分配时,通过最终模型与天然环之间的均方根偏差(rmsd)来评估预测质量。这种平均rmsd范围从三个残基环的1.1 Å到八个残基环的3.8 Å。考虑到环的高度多样性和缺乏环境数据,一些预测不佳的环是不可避免的。为了克服此类建模问题,为每个预测分配了一个统计可靠性分数。该分数在rmsd方面与预测质量相关,从而提高了模型的选择准确性。将该算法的效率与CASP3目标环预测进行了比较。此外,当在一个测试环文库上进行测试时,该算法在环未精确界定的情况下显示出稳健性,因此被证明是蛋白质建模实践中的一个有用工具。