Young M M, Skillman A G, Kuntz I D
Department of Pharmaceutical Chemistry, University of California, San Francisco 94143-0446, USA.
Proteins. 1999 Feb 15;34(3):317-32.
We have developed an automatic protein fingerprinting method for the evaluation of protein structural similarities based on secondary structure element compositions, spatial arrangements, lengths, and topologies. This method can rapidly identify proteins sharing structural homologies as we demonstrate with five test cases: the globins, the mammalian trypsinlike serine proteases, the immunoglobulins, the cupredoxins, and the actinlike ATPase domain-containing proteins. Principal components analysis of the similarity distance matrix calculated from an all-by-all comparison of 1,031 unique chains in the Protein Data Bank has produced a distribution of structures within a high-dimensional structural space. Fifty percent of the variance observed for this distribution is bounded by six axes, two of which encode structural variability within two large families, the immunoglobulins and the trypsinlike serine proteases. Many aspects of the spatial distribution remain stable upon reduction of the database to 140 proteins with minimal family overlap. The axes correlated with specific structural families are no longer observed. A clear hierarchy of organization is seen in the arrangement of protein structures in the universe. At the highest level, protein structures populate regions corresponding to the all-alpha, all-beta, and alpha/beta superfamilies. Large protein families are arranged along family-specific axes, forming local densely populated regions within the space. The lowest level of organization is intrafamilial; homologous structures are ordered by variations in peripheral secondary structure elements or by conformational shifts in the tertiary structure.
我们开发了一种自动蛋白质指纹识别方法,用于基于二级结构元件组成、空间排列、长度和拓扑结构评估蛋白质结构相似性。正如我们通过五个测试案例所展示的那样,该方法可以快速识别具有结构同源性的蛋白质:球蛋白、哺乳动物类胰蛋白酶丝氨酸蛋白酶、免疫球蛋白、铜氧化还原蛋白以及含肌动蛋白样ATP酶结构域的蛋白质。对从蛋白质数据库中1031条独特链的全对全比较计算得出的相似性距离矩阵进行主成分分析,得出了高维结构空间内的结构分布。此分布中观察到的50%的方差由六个轴界定,其中两个轴编码两个大家族(免疫球蛋白和类胰蛋白酶丝氨酸蛋白酶)内的结构变异性。当将数据库缩减为家族重叠最小的140种蛋白质时,空间分布的许多方面仍保持稳定。与特定结构家族相关的轴不再出现。在宇宙中蛋白质结构的排列中可以看到明显的组织层次结构。在最高层次上,蛋白质结构分布在对应于全α、全β和α/β超家族的区域。大的蛋白质家族沿着家族特异性轴排列,在空间内形成局部密集区域。最低层次的组织是家族内部的;同源结构通过外围二级结构元件的变化或三级结构的构象变化进行排序。