Altenhoff Adrian M, Dessimoz Christophe
Institute of Computational Science, ETH Zurich, and Swiss Institute of Bioinformatics, Zürich, Switzerland.
PLoS Comput Biol. 2009 Jan;5(1):e1000262. doi: 10.1371/journal.pcbi.1000262. Epub 2009 Jan 16.
Accurate genome-wide identification of orthologs is a central problem in comparative genomics, a fact reflected by the numerous orthology identification projects developed in recent years. However, only a few reports have compared their accuracy, and indeed, several recent efforts have not yet been systematically evaluated. Furthermore, orthology is typically only assessed in terms of function conservation, despite the phylogeny-based original definition of Fitch. We collected and mapped the results of nine leading orthology projects and methods (COG, KOG, Inparanoid, OrthoMCL, Ensembl Compara, Homologene, RoundUp, EggNOG, and OMA) and two standard methods (bidirectional best-hit and reciprocal smallest distance). We systematically compared their predictions with respect to both phylogeny and function, using six different tests. This required the mapping of millions of sequences, the handling of hundreds of millions of predicted pairs of orthologs, and the computation of tens of thousands of trees. In phylogenetic analysis or in functional analysis where high specificity is required, we find that OMA and Homologene perform best. At lower functional specificity but higher coverage level, OrthoMCL outperforms Ensembl Compara, and to a lesser extent Inparanoid. Lastly, the large coverage of the recent EggNOG can be of interest to build broad functional grouping, but the method is not specific enough for phylogenetic or detailed function analyses. In terms of general methodology, we observe that the more sophisticated tree reconstruction/reconciliation approach of Ensembl Compara was at times outperformed by pairwise comparison approaches, even in phylogenetic tests. Furthermore, we show that standard bidirectional best-hit often outperforms projects with more complex algorithms. First, the present study provides guidance for the broad community of orthology data users as to which database best suits their needs. Second, it introduces new methodology to verify orthology. And third, it sets performance standards for current and future approaches.
全基因组直系同源基因的准确鉴定是比较基因组学中的核心问题,这一事实从近年来开展的众多直系同源基因鉴定项目中可见一斑。然而,仅有少数报告比较了它们的准确性,实际上,最近的一些研究成果尚未得到系统评估。此外,尽管菲奇基于系统发育的原始定义,但直系同源性通常仅依据功能保守性进行评估。我们收集并整理了九个领先的直系同源基因鉴定项目和方法(COG、KOG、Inparanoid、OrthoMCL、Ensembl Compara、Homologene、RoundUp、EggNOG和OMA)以及两种标准方法(双向最佳匹配和互反最小距离)的结果。我们使用六种不同测试,系统地比较了它们在系统发育和功能方面的预测结果。这需要对数百万个序列进行映射,处理数亿对预测的直系同源基因,并计算数万棵树。在需要高特异性的系统发育分析或功能分析中,我们发现OMA和Homologene表现最佳。在功能特异性较低但覆盖水平较高的情况下,OrthoMCL优于Ensembl Compara,在较小程度上也优于Inparanoid。最后,近期的EggNOG覆盖范围广,对于构建广泛的功能分组可能具有价值,但该方法在系统发育或详细功能分析方面特异性不足。在一般方法方面,我们观察到,即使在系统发育测试中,Ensembl Compara更为复杂的树重建/比对方法有时也会被成对比较方法超越。此外,我们表明标准的双向最佳匹配方法通常优于算法更复杂的项目。首先,本研究为广大直系同源基因数据用户提供了指导,告知他们哪个数据库最符合其需求。其次,它引入了验证直系同源性的新方法。第三,它为当前和未来的方法设定了性能标准。