Yona G, Linial N, Tishby N, Linial M
Institute of Computer Science, Hebrew University, Jerusalem, Israel. golany,nati,
Proc Int Conf Intell Syst Mol Biol. 1998;6:212-21.
We investigate the space of all protein sequences. We combine the standard measures of similarity (SW, FASTA, BLAST), to associate with each sequence an exhaustive list of neighboring sequences. These lists induce a (weighted directed) graph whose vertices are the sequences. The weight of an edge connecting two sequences represents their degree of similarity. This graph encodes much of the fundamental properties of the sequence space. We look for clusters of related proteins in this graph. These clusters correspond to strongly connected sets of vertices. Two main ideas underlie our work: i) Interesting homologies among proteins can be deduced by transitivity. ii) Transitivity should be applied restrictively in order to prevent unrelated proteins from clustering together. Our analysis starts from a very conservative classification, based on very significant similarities, that has many classes. Subsequently, classes are merged to include less significant similarities. Merging is performed via a novel two phase algorithm. First, the algorithm identifies groups of possibly related clusters (based on transitivity and strong connectivity) using local considerations, and merges them. Then, a global test is applied to identify nuclei of strong relationships within these groups of clusters, and the classification is refined accordingly. This process takes place at varying thresholds of statistical significance, where at each step the algorithm is applied on the classes of the previous classification, to obtain the next one, at the more permissive threshold. Consequently, a hierarchical organization of all proteins is obtained. The resulting classification splits the space of all protein sequences into well defined groups of proteins. The results show that the automatically induced sets of proteins are closely correlated with natural biological families and super families. The hierarchical organization reveals finer sub-families that make up known families of proteins as well as many interesting relations between protein families. The hierarchical organization proposed may be considered as the first map of the space of all protein sequences. An interactive web site including the results of our analysis has been constructed, and is now accessible through http:/(/)www.protomap.cs.huji.ac.il
我们研究了所有蛋白质序列的空间。我们结合了相似性的标准度量(SW、FASTA、BLAST),为每个序列关联一个详尽的相邻序列列表。这些列表诱导出一个(加权有向)图,其顶点为序列。连接两个序列的边的权重表示它们的相似程度。这个图编码了序列空间的许多基本属性。我们在这个图中寻找相关蛋白质的簇。这些簇对应于顶点的强连通集。我们的工作基于两个主要思想:i)蛋白质之间有趣的同源性可以通过传递性推导出来。ii)为了防止不相关的蛋白质聚集在一起,传递性应该受到限制。我们的分析从基于非常显著相似性的非常保守的分类开始,该分类有许多类别。随后,类别被合并以纳入不太显著的相似性。合并通过一种新颖的两阶段算法进行。首先,该算法使用局部考虑因素识别可能相关的簇组(基于传递性和强连通性),并将它们合并。然后,应用全局测试来识别这些簇组内强关系的核心,并相应地完善分类。这个过程在不同的统计显著性阈值下进行,在每个步骤中,算法应用于前一个分类的类别,以在更宽松的阈值下获得下一个分类。因此,得到了所有蛋白质的层次结构。最终的分类将所有蛋白质序列的空间划分为定义明确的蛋白质组。结果表明,自动诱导的蛋白质集与自然生物家族和超家族密切相关。层次结构揭示了构成已知蛋白质家族的更精细的亚家族以及蛋白质家族之间的许多有趣关系。所提出的层次结构可被视为所有蛋白质序列空间的第一张图谱。一个包含我们分析结果的交互式网站已经构建完成,现在可通过http:/(/)www.protomap.cs.huji.ac.il访问