Gerstein M, Levitt M
Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA.
Protein Sci. 1998 Feb;7(2):445-56. doi: 10.1002/pro.5560070226.
We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side-chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.
我们应用一种基于三维结构的简单方法,大规模地将蛋白质序列与蛋白质结构分类数据库(scop)中折叠家族的蛋白质进行比对。这使我们能够根据一个客观的、人工推导的标准来评估、理解并改进我们的自动方法,这种全面评估对于其他结构比对算法来说是无法实现的。我们的基本方法直接匹配两个结构的主链,通过动态规划和最小二乘法拟合的重复循环来确定使坐标差异最小化的比对。由于方法简单,我们的方法可以很容易地进行修改,以考虑蛋白质结构的其他特征,如侧链的方向或打开缺口的位置依赖性成本。通过这些修改增强后的基本方法,能够为蛋白质结构分类数据库中除1.5%之外的所有已知结构相似性找到合理的比对,即2107个超家族对中除32对外的所有比对。我们讨论了使这32对难以比对的特定蛋白质结构特征,并展示了我们的程序如何根据所涉及的蛋白质结构方面(例如,根据正确比对是否需要考虑侧链方向)有效地将蛋白质结构分类数据库中的关系划分为不同类别。我们还展示了如何扩展我们的两两比对程序以生成一组相关结构的多重比对。我们已将这些比对与从文献中挑选出的相应人工比对进行了详细比较。我们发现两者吻合度良好(核心区域在95%以内),详细比较突出了特定蛋白质结构特征(如某些链)在比对时存在问题,结果有些模糊。通过这些改进和系统测试,我们的程序应该对蛋白质结构分类数据库的发展以及未来蛋白质折叠的分类有用。