Merkeev Igor V, Mironov Andrey A
State Scientific Center GosNIIGenetica, 1st Dorozhny pr,, 1, Moscow, 113545, Russia.
BMC Evol Biol. 2006 Jun 22;6:51. doi: 10.1186/1471-2148-6-51.
The need to compare protein profiles frequently arises in various protein research areas: comparison of protein families, domain searches, resolution of orthology and paralogy. The existing fast algorithms can only compare a protein sequence with a protein sequence and a profile with a sequence. Algorithms to compare profiles use dynamic programming and complex scoring functions.
We developed a new algorithm called PHOG-BLAST for fast similarity search of profiles. This algorithm uses profile discretization to convert a profile to a finite alphabet and utilizes hashing for fast search. To determine the optimal alphabet, we analyzed columns in reliable multiple alignments and obtained column clusters in the 20-dimensional profile space by applying a special clustering procedure. We show that the clustering procedure works best if its parameters are chosen so that 20 profile clusters are obtained which can be interpreted as ancestral amino acid residues. With these clusters, only less than 2% of columns in multiple alignments are out of clusters. We tested the performance of PHOG-BLAST vs. PSI-BLAST on three well-known databases of multiple alignments: COG, PFAM and BALIBASE. On the COG database both algorithms showed the same performance, on PFAM and BALIBASE PHOG-BLAST was much superior to PSI-BLAST. PHOG-BLAST required 10-20 times less computer memory and computation time than PSI-BLAST.
Since PHOG-BLAST can compare multiple alignments of protein families, it can be used in different areas of comparative proteomics and protein evolution. For example, PHOG-BLAST helped to build the PHOG database of phylogenetic orthologous groups. An essential step in building this database was comparing protein complements of different species and orthologous groups of different taxons on a personal computer in reasonable time. When it is applied to detect weak similarity between protein families, PHOG-BLAST is less precise than rigorous profile-profile comparison method, though it runs much faster and can be used as a hit pre-selecting tool.
在各种蛋白质研究领域中,经常需要比较蛋白质谱:蛋白质家族比较、结构域搜索、直系同源和旁系同源关系的解析。现有的快速算法只能将蛋白质序列与蛋白质序列进行比较,以及将谱与序列进行比较。比较谱的算法使用动态规划和复杂的评分函数。
我们开发了一种名为PHOG-BLAST的新算法,用于快速搜索谱的相似性。该算法使用谱离散化将谱转换为有限字母表,并利用哈希进行快速搜索。为了确定最佳字母表,我们分析了可靠多序列比对中的列,并通过应用特殊的聚类程序在20维谱空间中获得列簇。我们表明,如果选择其参数使得获得20个谱簇,这些簇可被解释为祖先氨基酸残基,则聚类程序效果最佳。有了这些簇,多序列比对中只有不到2%的列不在簇中。我们在三个著名的多序列比对数据库COG、PFAM和BALIBASE上测试了PHOG-BLAST与PSI-BLAST的性能。在COG数据库上,两种算法表现相同,在PFAM和BALIBASE上,PHOG-BLAST远优于PSI-BLAST。PHOG-BLAST所需的计算机内存和计算时间比PSI-BLAST少10到20倍。
由于PHOG-BLAST可以比较蛋白质家族的多序列比对,因此可用于比较蛋白质组学和蛋白质进化的不同领域。例如,PHOG-BLAST有助于构建系统发育直系同源组的PHOG数据库。构建该数据库的一个关键步骤是在个人计算机上合理的时间内比较不同物种的蛋白质补体和不同分类单元的直系同源组。当应用于检测蛋白质家族之间的弱相似性时,PHOG-BLAST不如严格的谱-谱比较方法精确,尽管它运行速度快得多,可作为命中预选择工具。