Department of Chemistry, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.
PLoS One. 2007 Apr 18;2(4):e383. doi: 10.1371/journal.pone.0000383.
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale 'gold standard' orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology.
同源基因检测对于准确的功能注释至关重要,并已广泛用于促进比较和进化基因组学的研究。虽然现在有各种方法,但由于缺乏基因组规模的“黄金标准”同源数据集,因此尚未对性能进行全面分析。即使没有此类数据集,替代方法的结果比较也包含有用的信息,因为一致性增强了信心,而不一致性则表明可能存在错误。潜在类别分析(LCA)是一种统计技术,可以利用此信息合理推断敏感性和特异性,并将其应用于评估各种同源基因检测方法在真核数据集上的性能。总体而言,我们观察到同源基因检测中的敏感性和特异性之间存在权衡,基于 BLAST 的方法具有高敏感性,基于树的方法具有高特异性。两种算法的总体平衡效果最佳,敏感性和特异性均>80%:INPARANOID 在两个物种之间识别同源基因,而 OrthoMCL 则从多个物种聚类同源基因。在允许跨多个基因组聚类同源基因群的方法中,与(手动整理的)KOG 数据库相比,(自动)OrthoMCL 算法在蛋白质功能和结构域结构方面具有更好的组内一致性,并且优于同源聚类算法 TribeMCL。通过使用 LCA,我们还能够全面评估各种策略之间的相似性和统计依赖性,并评估参数设置对性能的影响。总之,我们在一组不同的真核基因组上对同源基因检测进行了全面评估,从而为不同应用提供了方法选择,调整和开发的见解和指导。许多生物学问题都已通过产生二进制(是/否)结果的多次测试得到解决,但没有明确的真理定义,这使得 LCA 成为计算生物学的一种有吸引力的方法。