SCPS：一种快速实现的基于谱方法的全基因组蛋白质家族检测。

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

机构信息

Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, TW20 0EX, Egham, UK.

出版信息

BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.

DOI:10.1186/1471-2105-11-120

PMID:20214776

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2841596/

Abstract

BACKGROUND

An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community.

RESULTS

SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences).

CONCLUSIONS

Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps.

摘要

背景

基因组学中的一个重要问题是自动推断出同源蛋白质的群组，这些群组基于两两序列相似性。为此任务提出了几种方法，这些方法在“局部”意义上是“局部的”，即它们仅根据蛋白质与集合中其他蛋白质之间的距离将蛋白质分配到一个聚类中。最近表明，全局方法（例如谱聚类）在各种数据集上具有更好的性能。然而，目前可用的谱聚类方法的实现主要由几个松散耦合的 Matlab 脚本组成，这些脚本假设对 Matlab 编程有一定的熟悉程度，因此对于研究界的大部分人来说是无法访问的。

结果

SCPS（蛋白质序列的谱聚类）是一种高效且用户友好的谱方法实现，用于推断蛋白质家族。该方法仅使用两两序列相似性，因此在仅具有序列信息时很实用。在从 SCOP 数据库提取关系的困难蛋白质集上测试了 SCPS，并且将其结果与其他流行的蛋白质聚类算法（例如 TribeMCL、层次聚类和连通分量分析）获得的结果进行了广泛比较。我们表明，SCPS 能够正确识别许多家族/超家族关系，并且所获得的聚类的质量（由其 F 分数表示）始终优于我们与之比较的所有其他方法。我们还通过对整个 SCOP 数据库（14183 个序列）和酵母酿酒酵母的完整基因组（6690 个序列）进行聚类来展示了 SCPS 的可扩展性。

结论

除了谱方法之外，SCPS 还实现了连通分量分析和层次聚类，它集成了 TribeMCL，它提供了不同的聚类质量工具，它可以使用来自 NCBI 的 GI 号提取人类可读的蛋白质描述，它与外部工具（如 BLAST 和 Cytoscape）接口，并且可以生成获得的聚类的出版质量图形表示，从而构成了计算生物学实际研究的综合有效工具。适用于 Windows、Linux 和 Mac OS X 的源代码和预编译可执行文件可在 http://www.paccanarolab.org/software/scps 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/eacb/2841596/59650ba93097/1471-2105-11-120-1.jpg

相似文献

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

BMC Bioinformatics. 2010 Mar 9;11:120. doi: 10.1186/1471-2105-11-120.

Spectral clustering of protein sequences.

Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. doi: 10.1093/nar/gkj515. Print 2006.

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.

BMC Bioinformatics. 2007 Oct 17;8:396. doi: 10.1186/1471-2105-8-396.

ProClust: improved clustering of protein sequences with an extended graph-based approach.

Bioinformatics. 2002;18 Suppl 2:S182-91. doi: 10.1093/bioinformatics/18.suppl_2.s182.

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families.

BMC Bioinformatics. 2015 Feb 5;16:34. doi: 10.1186/s12859-014-0445-4.

Using affinity propagation combined post-processing to cluster protein sequences.

Protein Pept Lett. 2010 Jun;17(6):681-9. doi: 10.2174/092986610791190255.

Evaluation of BLAST-based edge-weighting metrics used for homology inference with the Markov Clustering algorithm.

BMC Bioinformatics. 2015 Jul 10;16:218. doi: 10.1186/s12859-015-0625-x.

clusterMaker: a multi-algorithm clustering plugin for Cytoscape.

BMC Bioinformatics. 2011 Nov 9;12:436. doi: 10.1186/1471-2105-12-436.

Detecting clusters of different geometrical shapes in microarray gene expression data.

Bioinformatics. 2005 May 1;21(9):1927-34. doi: 10.1093/bioinformatics/bti251. Epub 2005 Jan 12.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

引用本文的文献

clusterMaker2: a major update to clusterMaker, a multi-algorithm clustering app for Cytoscape.

BMC Bioinformatics. 2023 Apr 5;24(1):134. doi: 10.1186/s12859-023-05225-z.

Reference-based read clustering improves the genome assembly of microbial strains.

Comput Struct Biotechnol J. 2022 Dec 21;21:444-451. doi: 10.1016/j.csbj.2022.12.032. eCollection 2023.

The Venturia inaequalis effector repertoire is dominated by expanded families with predicted structural similarity, but unrelated sequence, to avirulence proteins from other plant-pathogenic fungi.

BMC Biol. 2022 Nov 3;20(1):246. doi: 10.1186/s12915-022-01442-9.

Pharmacological affinity fingerprints derived from bioactivity data for the identification of designer drugs.

J Cheminform. 2022 Jun 7;14(1):35. doi: 10.1186/s13321-022-00607-6.

Orb-weaving spider Araneus ventricosus genome elucidates the spidroin gene catalogue.

Sci Rep. 2019 Jun 10;9(1):8380. doi: 10.1038/s41598-019-44775-2.

Identification of Resistance Genes and Response to Arsenic in BCP1.

Front Microbiol. 2019 May 7;10:888. doi: 10.3389/fmicb.2019.00888. eCollection 2019.

Early Diverging Insect-Pathogenic Fungi of the Order Entomophthorales Possess Diverse and Unique Subtilisin-Like Serine Proteases.

G3 (Bethesda). 2018 Oct 3;8(10):3311-3319. doi: 10.1534/g3.118.200656.

Whole genome sequence and comparative analysis of Borrelia burgdorferi MM1.

PLoS One. 2018 Jun 11;13(6):e0198135. doi: 10.1371/journal.pone.0198135. eCollection 2018.

Bipartite graphs in systems biology and medicine: a survey of methods and applications.

Gigascience. 2018 Apr 1;7(4):1-31. doi: 10.1093/gigascience/giy014.

Evolutionary Analysis of HIV-1 Pol Proteins Reveals Representative Residues for Viral Subtype Differentiation.

Front Microbiol. 2017 Nov 2;8:2151. doi: 10.3389/fmicb.2017.02151. eCollection 2017.

本文引用的文献

Sequence context-specific profiles for homology searching.

Proc Natl Acad Sci U S A. 2009 Mar 10;106(10):3770-5. doi: 10.1073/pnas.0810767106. Epub 2009 Feb 20.

STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Nucleic Acids Res. 2009 Jan;37(Database issue):D412-6. doi: 10.1093/nar/gkn760. Epub 2008 Oct 21.

Gene Ontology annotations at SGD: new data sources and annotation methods.

Nucleic Acids Res. 2008 Jan;36(Database issue):D577-81. doi: 10.1093/nar/gkm909. Epub 2007 Nov 3.

Metabolome based reaction graphs of M. tuberculosis and M. leprae: a comparative network analysis.

PLoS One. 2007 Sep 12;2(9):e881. doi: 10.1371/journal.pone.0000881.

Spectral clustering of protein sequences.

Nucleic Acids Res. 2006 Mar 17;34(5):1571-80. doi: 10.1093/nar/gkj515. Print 2006.

SpectralNET--an application for spectral graph analysis and visualization.

BMC Bioinformatics. 2005 Oct 19;6:260. doi: 10.1186/1471-2105-6-260.

A graph spectral analysis of the structural similarity network of protein chains.

Proteins. 2005 Oct 1;61(1):152-63. doi: 10.1002/prot.20532.

The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Nucleic Acids Res. 2004 Oct 14;32(18):5539-45. doi: 10.1093/nar/gkh894. Print 2004.

Fast algorithm for detecting community structure in networks.

Phys Rev E Stat Nonlin Soft Matter Phys. 2004 Jun;69(6 Pt 2):066133. doi: 10.1103/PhysRevE.69.066133. Epub 2004 Jun 18.

The ASTRAL Compendium in 2004.

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D189-92. doi: 10.1093/nar/gkh034.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

SCPS：一种快速实现的基于谱方法的全基因组蛋白质家族检测。

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献