Linial M, Linial N, Tishby N, Yona G
Department of Biological Chemistry, Institute of Life Sciences, Hebrew University, Jerusalem, Israel.
J Mol Biol. 1997 May 2;268(2):539-56. doi: 10.1006/jmbi.1997.0948.
A global classification of all currently known protein sequences is performed. Every protein sequence is partitioned into segments of 50 amino acid residues and a dynamic programming distance is calculated between each pair of segments. This space of segments is initially embedded into Euclidean space. The algorithm that we apply embeds every finite metric space into Euclidean space so that (1) the dimension of the host space is small, (2) the metric distortion is small. A novel self-organized, cross-validated clustering algorithm is then applied to the embedded space with Euclidean distances. We monitor the validity of our clustering by randomly splitting the data into two parts and performing an hierarchical clustering algorithm independently on each part. At every level of the hierarchy we cross-validate the clusters in one part with the clusters in the other. The resulting hierarchical tree of clusters offers a new representation of protein sequences and families, which compares favorably with the most updated classifications based on functional and structural data about proteins. Some of the known families clustered into well distinct clusters. Motifs and domains such as the zinc finger, EF hand, homeobox, EGF-like and others are automatically correctly identified, and relations between protein families are revealed by examining the splits along the tree. This clustering leads to a novel representation of protein families, from which functional biological kinship of protein families can be deduced, as demonstrated for the transporter family. Finally, we introduce a new concise representation for complete proteins that is very useful in presenting multiple alignments, and in searching for close relatives in the database. The self-organization method presented is very general and applies to any data with a consistent and computable measure of similarity between data items.
对所有当前已知的蛋白质序列进行了全球分类。每个蛋白质序列被分割成长度为50个氨基酸残基的片段,并计算每对片段之间的动态规划距离。这些片段空间最初被嵌入到欧几里得空间中。我们应用的算法将每个有限度量空间嵌入到欧几里得空间中,使得:(1)宿主空间的维度较小;(2)度量失真较小。然后,一种新颖的自组织、交叉验证聚类算法被应用于具有欧几里得距离的嵌入空间。我们通过随机将数据分成两部分,并在每一部分上独立执行层次聚类算法来监测聚类的有效性。在层次结构的每个级别上,我们用另一部分中的聚类对一部分中的聚类进行交叉验证。所得的聚类层次树提供了蛋白质序列和家族的一种新表示,与基于蛋白质功能和结构数据的最新分类相比具有优势。一些已知家族聚集成了明显不同的聚类。锌指、EF手、同源异型框、EGF样等基序和结构域被自动正确识别,通过检查树中的分支揭示了蛋白质家族之间的关系。这种聚类导致了蛋白质家族的一种新表示,从中可以推导出蛋白质家族的功能生物学亲缘关系,如转运蛋白家族所示。最后,我们为完整蛋白质引入了一种新的简洁表示,这在呈现多重比对和在数据库中搜索近亲时非常有用。所提出的自组织方法非常通用,适用于任何具有数据项之间一致且可计算相似性度量的数据。