跨系统发育距离较远的物种对直系同源蛋白进行聚类。

Kim Sunshin, Kang Jaewoo, Chung Yong Je, Li Jinyan, Ryu Keun Ho

School of Electrical and Computer Engineering, Chungbuk National University, Cheongju, Korea.

Proteins. 2008 May 15;71(3):1113-22. doi: 10.1002/prot.21792.

The quality of orthologous protein clusters (OPCs) is largely dependent on the results of the reciprocal BLAST (basic local alignment search tool) hits among genomes. The BLAST algorithm is very efficient and fast, but it is very difficult to get optimal solution among phylogenetically distant species because the genomes with large evolutionary distance typically have low similarity in their protein sequences. To reduce the false positives in the OPCs, thresholding is often employed on the BLAST scores. However, the thresholding also eliminates large numbers of true positives as the orthologs from distant species likely have low BLAST scores. To rectify this problem, we introduce a new hybrid method combining the Recursive and the Markov CLuster (MCL) algorithms without using the BLAST thresholding. In the first step, we use InParanoid to produce n(n-1)/2 ortholog tables from n genomes. After combining all the tables into one, our clustering algorithm clusters ortholog pairs recursively in the table. Then, our method employs MCL algorithm to compute the clusters and refines the clusters by adjusting the inflation factor. We tested our method using six different genomes and evaluated the results by comparing against Kegg Orthology (KO) OPCs, which are generated from manually curated pathways. To quantify the accuracy of the results, we introduced a new intuitive similarity measure based on our Least-move algorithm that computes the consistency between two OPCs. We compared the resulting OPCs with the KO OPCs using this measure. We also evaluated the performance of our method using InParanoid as the baseline approach. The experimental results show that, at the inflation factor 1.3, we produced 54% more orthologs than InParanoid sacrificing a little less accuracy (1.7% less) than InParanoid, and at the factor 1.4, produced not only 15% more orthologs than InParanoid but also a higher accuracy (1.4% more) than InParanoid.

直系同源蛋白簇（OPC）的质量在很大程度上取决于基因组之间双向BLAST（基本局部比对搜索工具）比对的结果。BLAST算法非常高效且快速，但在系统发育距离较远的物种中很难获得最优解，因为进化距离大的基因组其蛋白质序列通常相似度较低。为了减少OPC中的假阳性，通常会对BLAST分数进行阈值处理。然而，阈值处理也会消除大量真阳性，因为来自远缘物种的直系同源物可能具有较低的BLAST分数。为了解决这个问题，我们引入了一种新的混合方法，该方法结合了递归算法和马尔可夫聚类（MCL）算法，且不使用BLAST阈值处理。第一步，我们使用InParanoid从n个基因组中生成n(n - 1)/2个直系同源表。将所有表合并为一个表后，我们的聚类算法在表中对直系同源对进行递归聚类。然后，我们的方法使用MCL算法计算聚类，并通过调整膨胀因子来优化聚类。我们使用六个不同的基因组测试了我们的方法，并通过与从人工策划的通路生成的京都基因与基因组百科全书（KEGG）直系同源关系（KO）OPC进行比较来评估结果。为了量化结果的准确性，我们基于我们的最少移动算法引入了一种新的直观相似性度量，该算法计算两个OPC之间的一致性。我们使用这种度量将所得的OPC与KO OPC进行比较。我们还以InParanoid作为基线方法评估了我们方法的性能。实验结果表明，在膨胀因子为1.3时，我们生成的直系同源物比InParanoid多54%，而准确性比InParanoid略低（低1.7%）；在膨胀因子为1.4时，我们生成的直系同源物不仅比InParanoid多15%，而且准确性比InParanoid高（高1.4%）。

相似文献

Clustering orthologous proteins across phylogenetically distant species.

Proteins. 2008 May 15;71(3):1113-22. doi: 10.1002/prot.21792.

Automatic clustering of orthologs and inparalogs shared by multiple proteomes.

Bioinformatics. 2006 Jul 15;22(14):e9-15. doi: 10.1093/bioinformatics/btl213.

ReMark: an automatic program for clustering orthologs flexibly combining a Recursive and a Markov clustering algorithms.

Bioinformatics. 2011 Jun 15;27(12):1731-3. doi: 10.1093/bioinformatics/btr259. Epub 2011 May 5.

A phylogenomic analysis of the Ascomycota.

Fungal Genet Biol. 2006 Oct;43(10):715-25. doi: 10.1016/j.fgb.2006.05.001. Epub 2006 Jun 15.

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes.

Nucleic Acids Res. 2006 Jan 25;34(2):647-58. doi: 10.1093/nar/gkj448. Print 2006.

Inparanoid: a comprehensive database of eukaryotic orthologs.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D476-80. doi: 10.1093/nar/gki107.

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques.

Comput Biol Chem. 2004 Jul;28(3):211-8. doi: 10.1016/j.compbiolchem.2004.03.002.

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

J Mol Biol. 2001 Dec 14;314(5):1041-52. doi: 10.1006/jmbi.2000.5197.

A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.

BMC Bioinformatics. 2008 Oct 7;9:419. doi: 10.1186/1471-2105-9-419.

A hybrid clustering approach to recognition of protein families in 114 microbial genomes.

BMC Bioinformatics. 2004 Apr 29;5:45. doi: 10.1186/1471-2105-5-45.

引用本文的文献

Structural evolution drives diversification of the large LRR-RLK gene family.

New Phytol. 2020 Jun;226(5):1492-1505. doi: 10.1111/nph.16455. Epub 2020 Feb 29.

A Cross-Species Study of PI3K Protein-Protein Interactions Reveals the Direct Interaction of P85 and SHP2.

Sci Rep. 2016 Feb 3;6:20471. doi: 10.1038/srep20471.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Clustering orthologous proteins across phylogenetically distant species.

Proteins. 2008 May 15;71(3):1113-22. doi: 10.1002/prot.21792.

Automatic clustering of orthologs and inparalogs shared by multiple proteomes.

Bioinformatics. 2006 Jul 15;22(14):e9-15. doi: 10.1093/bioinformatics/btl213.

ReMark: an automatic program for clustering orthologs flexibly combining a Recursive and a Markov clustering algorithms.

Bioinformatics. 2011 Jun 15;27(12):1731-3. doi: 10.1093/bioinformatics/btr259. Epub 2011 May 5.

A phylogenomic analysis of the Ascomycota.

Fungal Genet Biol. 2006 Oct;43(10):715-25. doi: 10.1016/j.fgb.2006.05.001. Epub 2006 Jun 15.

Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes.

Nucleic Acids Res. 2006 Jan 25;34(2):647-58. doi: 10.1093/nar/gkj448. Print 2006.

Inparanoid: a comprehensive database of eukaryotic orthologs.

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D476-80. doi: 10.1093/nar/gki107.

Cluster-C, an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques.

Comput Biol Chem. 2004 Jul;28(3):211-8. doi: 10.1016/j.compbiolchem.2004.03.002.

Automatic clustering of orthologs and in-paralogs from pairwise species comparisons.

J Mol Biol. 2001 Dec 14;314(5):1041-52. doi: 10.1006/jmbi.2000.5197.

A database of phylogenetically atypical genes in archaeal and bacterial genomes, identified using the DarkHorse algorithm.

BMC Bioinformatics. 2008 Oct 7;9:419. doi: 10.1186/1471-2105-9-419.

A hybrid clustering approach to recognition of protein families in 114 microbial genomes.

BMC Bioinformatics. 2004 Apr 29;5:45. doi: 10.1186/1471-2105-5-45.

引用本文的文献

Structural evolution drives diversification of the large LRR-RLK gene family.

New Phytol. 2020 Jun;226(5):1492-1505. doi: 10.1111/nph.16455. Epub 2020 Feb 29.

A Cross-Species Study of PI3K Protein-Protein Interactions Reveals the Direct Interaction of P85 and SHP2.

Sci Rep. 2016 Feb 3;6:20471. doi: 10.1038/srep20471.

Clustering orthologous proteins across phylogenetically distant species.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献