Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, TW20 0EX, Egham, UK.
BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.
An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community.
SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences).
Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps.
基因组学中的一个重要问题是自动推断出同源蛋白质的群组,这些群组基于两两序列相似性。为此任务提出了几种方法,这些方法在“局部”意义上是“局部的”,即它们仅根据蛋白质与集合中其他蛋白质之间的距离将蛋白质分配到一个聚类中。最近表明,全局方法(例如谱聚类)在各种数据集上具有更好的性能。然而,目前可用的谱聚类方法的实现主要由几个松散耦合的 Matlab 脚本组成,这些脚本假设对 Matlab 编程有一定的熟悉程度,因此对于研究界的大部分人来说是无法访问的。
SCPS(蛋白质序列的谱聚类)是一种高效且用户友好的谱方法实现,用于推断蛋白质家族。该方法仅使用两两序列相似性,因此在仅具有序列信息时很实用。在从 SCOP 数据库提取关系的困难蛋白质集上测试了 SCPS,并且将其结果与其他流行的蛋白质聚类算法(例如 TribeMCL、层次聚类和连通分量分析)获得的结果进行了广泛比较。我们表明,SCPS 能够正确识别许多家族/超家族关系,并且所获得的聚类的质量(由其 F 分数表示)始终优于我们与之比较的所有其他方法。我们还通过对整个 SCOP 数据库(14183 个序列)和酵母酿酒酵母的完整基因组(6690 个序列)进行聚类来展示了 SCPS 的可扩展性。
除了谱方法之外,SCPS 还实现了连通分量分析和层次聚类,它集成了 TribeMCL,它提供了不同的聚类质量工具,它可以使用来自 NCBI 的 GI 号提取人类可读的蛋白质描述,它与外部工具(如 BLAST 和 Cytoscape)接口,并且可以生成获得的聚类的出版质量图形表示,从而构成了计算生物学实际研究的综合有效工具。适用于 Windows、Linux 和 Mac OS X 的源代码和预编译可执行文件可在 http://www.paccanarolab.org/software/scps 上免费获得。