Aloy Patrick, Oliva Baldomero, Querol Enrique, Aviles Francesc X, Russell Robert B
EMBL, Biocomputing, Meyerhofstrasse 1, D-69117 Heidelberg, Germany.
Protein Sci. 2002 May;11(5):1101-16. doi: 10.1110/ps.3950102.
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.
当前结构生物学的发展速度意味着在了解蛋白质功能之前就能够知晓其三维结构,这使得通过结构比较来确定同源性的方法变得愈发重要。先前的研究表明,基于结构比对后的序列相似性是同源性以及通常功能相似性的最佳判别指标之一。在此,我们利用这一观察结果,并结合蛋白质结构与序列数据库的合并,来预测远缘同源关系。我们使用蛋白质结构分类(SCOP)数据库将来自SMART和Pfam数据库的序列比对进行关联。由此,我们提供了在缺乏已知三维结构的情况下难以轻易构建的新比对。然后,我们扩展了Murzin(1993b)的方法,为结构比对后发现的序列同一性赋予统计学意义,从而确定不同序列家族之间的最佳关联。我们发现几个远缘相关的蛋白质序列家族能够被可靠地关联起来,这表明该方法是在蛋白质结构已知但功能未知时推断同源关系以及可能功能的一种手段。分析还发现了几个新的潜在超家族,对相关比对和叠加的检查揭示了异常结构特征的保守性或保守氨基酸与结合底物的共定位。我们讨论了对结构基因组计划的影响以及对序列比较方法改进的意义。