Kihara Daisuke, Skolnick Jeffrey
Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington St, Suite 300, Buffalo, NY 14203, USA.
J Mol Biol. 2003 Dec 5;334(4):793-802. doi: 10.1016/j.jmb.2003.10.027.
Structure comparisons of all representative proteins have been done. Employing the relative root mean square deviation (RMSD) from native enables the assessment of the statistical significance of structure alignments of different lengths in terms of a Z-score. Two conclusions emerge: first, proteins with their native fold can be distinguished by their Z-score. Second and somewhat surprising, all small proteins up to 100 residues in length have significant structure alignments to other proteins in a different secondary structure and fold class; i.e. 24.0% of them have 60% coverage by a template protein with a RMSD below 3.5A and 6.0% have 70% coverage. If the restriction that we align proteins only having different secondary structure types is removed, then in a representative benchmark set of proteins of 200 residues or smaller, 93% can be aligned to a single template structure (with average sequence identity of 9.8%), with a RMSD less than 4A, and 79% average coverage. In this sense, the current Protein Data Bank (PDB) is almost a covering set of small protein structures. The length of the aligned region (relative to the whole protein length) does not differ among the top hit proteins, indicating that protein structure space is highly dense. For larger proteins, non-related proteins can cover a significant portion of the structure. Moreover, these top hit proteins are aligned to different parts of the target protein, so that almost the entire molecule can be covered when combined. The number of proteins required to cover a target protein is very small, e.g. the top ten hit proteins can give 90% coverage below a RMSD of 3.5A for proteins up to 320 residues long. These results give a new view of the nature of protein structure space, and its implications for protein structure prediction are discussed.
已对所有代表性蛋白质进行了结构比较。利用相对于天然结构的相对均方根偏差(RMSD),可以根据Z分数评估不同长度结构比对的统计显著性。得出两个结论:第一,具有天然折叠的蛋白质可以通过其Z分数来区分。第二,有点令人惊讶的是,所有长度达100个残基的小蛋白质都与具有不同二级结构和折叠类别的其他蛋白质有显著的结构比对;也就是说,其中24.0%被RMSD低于3.5Å的模板蛋白质覆盖60%,6.0%被覆盖70%。如果去除我们只比对具有不同二级结构类型蛋白质的限制,那么在一个200个残基或更小的代表性蛋白质基准集中,93%可以与单个模板结构比对(平均序列同一性为9.8%),RMSD小于4Å,平均覆盖率为79%。从这个意义上说,当前的蛋白质数据库(PDB)几乎是一个小蛋白质结构的覆盖集。命中排名靠前的蛋白质之间比对区域的长度(相对于整个蛋白质长度)没有差异,这表明蛋白质结构空间高度密集。对于较大的蛋白质,不相关的蛋白质可以覆盖相当一部分结构。此外,这些命中排名靠前的蛋白质与目标蛋白质的不同部分比对,因此组合起来时几乎可以覆盖整个分子。覆盖一个目标蛋白质所需的蛋白质数量非常少,例如,对于长度达320个残基的蛋白质,排名前十的命中蛋白质在RMSD低于3.5Å时可以提供90%的覆盖率。这些结果给出了蛋白质结构空间性质的新观点,并讨论了其对蛋白质结构预测的影响。