评估通过疏水簇分析检测到的序列相似性的可靠性。

Assessing the reliability of sequence similarities detected through hydrophobic cluster analysis.

作者信息

Silva Pedro J

机构信息

REQUIMTE, Fac. de Ciências da Saúde, Univ. Fernando Pessoa, Rua Carlos da Maia, 296, 4200-150 Porto-Portugal.

出版信息

Proteins. 2008 Mar;70(4):1588-94. doi: 10.1002/prot.21803.

DOI:10.1002/prot.21803

PMID:17918727

Abstract

Hydrophobic cluster analysis (HCA) has long been used as a tool to detect distant homologies between protein sequences, and to classify them into different folds. However, it relies on expert human intervention, and is sensitive to subjective interpretations of pattern similarities. In this study, we describe a novel algorithm to assess the similarity of hydrophobic amino acid distributions between two sequences. Our algorithm correctly identifies as misattributions several HCA-based proposals of structural similarity between unrelated proteins present in the literature. We have also used this method to identify the proper fold of a large variety of sequences, and to automatically select the most appropriate structure for homology modeling of several proteins with low sequence identity to any other member of the protein data bank. Automatic modeling of the target proteins based on these templates yielded structures with TM-scores (vs. experimental structures) above 0.60, even without further refinement. Besides enabling a reliable identification of the correct fold of an unknown sequence and the choice of suitable templates, our algorithm also shows that whereas most structural classes of proteins are very homogeneous in hydrophobic cluster composition, a tenth of the described families are compatible with a large variety of hydrophobic patterns. We have built a browsable database of every major representative hydrophobic cluster pattern present in each structural class of proteins, freely available at http://www2.ufp.pt/ pedros/HCA_db/index.htm.

摘要

疏水簇分析（HCA）长期以来一直被用作一种工具，用于检测蛋白质序列之间的远缘同源性，并将它们分类为不同的折叠类型。然而，它依赖于专家的人工干预，并且对模式相似性的主观解释很敏感。在本研究中，我们描述了一种新算法，用于评估两个序列之间疏水氨基酸分布的相似性。我们的算法正确地将文献中基于HCA的几个关于不相关蛋白质之间结构相似性的提议识别为错误归属。我们还使用这种方法来识别各种序列的正确折叠类型，并自动为与蛋白质数据库中任何其他成员序列同一性较低的几种蛋白质的同源建模选择最合适的结构。基于这些模板对目标蛋白质进行自动建模，即使没有进一步优化，得到的结构与实验结构的TM分数也高于0.60。除了能够可靠地识别未知序列的正确折叠类型并选择合适的模板外，我们的算法还表明，虽然大多数蛋白质结构类在疏水簇组成上非常均匀，但所描述的家族中有十分之一与多种疏水模式兼容。我们建立了一个可浏览的数据库，其中包含蛋白质每个结构类中存在的每种主要代表性疏水簇模式，可在http://www2.ufp.pt/pedros/HCA_db/index.htm免费获取。