Joint Center for Structural Genomics, Center for Research in Biological Systems, University of California, San Diego, California 92093-0446, USA.
Proteins. 2011 Aug;79(8):2389-402. doi: 10.1002/prot.23049. Epub 2011 May 31.
The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects.
蛋白质的世界可以按照共享共同祖先的蛋白质进行分类,组成蛋白质家族。这些家族在结构和功能上的分化程度各有不同,从所有成员具有相同功能和非常相似结构的同型家族,到功能和结构差异巨大的异型家族。为了进行结构和功能预测,将具有高度相似结构(同型结构)和/或功能(同型功能)的蛋白质亚群进行分类,这对我们来说是非常有益的。我们比较了三种算法在聚类大型蛋白质家族方面的能力,并讨论了这些方法是否能够可靠地识别这些同型结构或同型功能的蛋白质家族。结果表明,使用序列轮廓和轮廓轮廓比较方法进行聚类,能够很好地重现基于 3D 结构相似性的聚类或具有相似生物学功能的蛋白质聚类。相比之下,目前仍在广泛使用的基于序列且带有固定阈值的方法会极大地高估蛋白质家族的结构和功能多样性。因此,这些方法还高估了为充分描述此类家族的结构空间而必须确定的蛋白质结构数量。事实上,人们可以根据明显不相关的模板构建可靠的模型,这对于从新的测序项目中提取最大信息量至关重要。