Kawaji Hideya, Takenaka Yoichi, Matsuda Hideo
Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan.
Bioinformatics. 2004 Jan 22;20(2):243-52. doi: 10.1093/bioinformatics/btg397.
Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins.
We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40,703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20,529 proteins) of the 26,917 classified by InterPro. It also finds relationships not found by other clustering methods.
The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/
蛋白质序列聚类广泛用于蛋白质的功能表征。然而,对远缘相关蛋白质进行聚类仍然不容易,这些蛋白质在序列之间仅具有区域相似性。因此,有必要开发一种算法来对这种远缘相关蛋白质进行聚类。
我们开发了一种时空高效的聚类算法。它使用一种图表示,其中其顶点和边分别表示蛋白质及其高于某个截止分数的序列相似性。它通过去除权重小的边(对应于低序列相似性)来反复划分图。为了找到合适的划分,我们引入了一个结合归一化割和局部最小割容量的分数。我们的方法应用于SWISS-PROT和TrEMBL中的全部40,703个人类蛋白质。所得聚类显示,在InterPro分类的26,917个蛋白质中召回率为76%(20,529个蛋白质)。它还发现了其他聚类方法未发现的关系。
我们算法对SWISS-PROT和TrEMBL中所有人蛋白质的完整结果以及其他补充信息可在http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/获取。