Rane Rahul V, Oakeshott John G, Nguyen Thu, Hoffmann Ary A, Lee Siu F
Bio21 Institute, School of Biosciences, The University of Melbourne, Melbourne, Victoria, Australia.
CSIRO, Canberra, Australian Capital Territory, Australia.
BMC Genomics. 2017 Aug 31;18(1):673. doi: 10.1186/s12864-017-4079-6.
Distinguishing orthologous and paralogous relationships between genes across multiple species is essential for comparative genomic analyses. Various computational approaches have been developed to resolve these evolutionary relationships, but strong trade-offs between precision and recall of orthologue prediction remains an ongoing challenge.
Here we present Orthonome, an orthologue prediction pipeline, designed to reduce the trade-off between orthologue capture rates (recall) and accuracy of multi-species orthologue prediction. The pipeline compares sequence domains and then forms sequence-similar clusters before using phylogenetic comparisons to identify inparalogues. It then corrects sequence similarity metrics for fragment and gene length bias using a novel scoring metric capturing relationships between full length as well as fragmented genes. The remaining genes are then brought together for the identification of orthologues within a phylogenetic framework. The orthologue predictions are further calibrated along with inparalogues and gene births, using synteny, to identify novel orthologous relationships. We use 12 high quality Drosophila genomes to show that, compared to other orthologue prediction pipelines, Orthonome provides orthogroups with minimal error but high recall. Furthermore, Orthonome is resilient to suboptimal assembly/annotation quality, with the inclusion of draft genomes from eight additional Drosophila species still providing >6500 1:1 orthologues across all twenty species while retaining a better combination of accuracy and recall than other pipelines. Orthonome is implemented as a searchable database and query tool along with multiple-sequence alignment browsers for all sets of orthologues. The underlying documentation and database are accessible at http://www.orthonome.com .
We demonstrate that Orthonome provides a superior combination of orthologue capture rates and accuracy on complete and draft drosophilid genomes when tested alongside previously published pipelines. The study also highlights a greater degree of evolutionary conservation across drosophilid species than earlier thought.
区分多个物种间基因的直系同源和旁系同源关系对于比较基因组分析至关重要。已开发出多种计算方法来解析这些进化关系,但直系同源物预测的精度和召回率之间存在强烈权衡,这仍是一个持续存在的挑战。
在此,我们展示了Orthonome,一种直系同源物预测流程,旨在减少直系同源物捕获率(召回率)与多物种直系同源物预测准确性之间的权衡。该流程先比较序列结构域,然后形成序列相似性聚类,再使用系统发育比较来识别旁系同源物。接着,它使用一种新颖的评分指标来校正片段和基因长度偏差对序列相似性度量的影响,该指标能捕捉全长基因和片段化基因之间的关系。然后将其余基因整合在一起,在系统发育框架内识别直系同源物。直系同源物预测会进一步结合旁系同源物和基因起源,利用共线性进行校准,以识别新的直系同源关系。我们使用12个高质量的果蝇基因组表明,与其他直系同源物预测流程相比,Orthonome提供的直系同源组错误最少但召回率高。此外,Orthonome能适应次优的组装/注释质量,纳入另外8个果蝇物种的草图基因组后,在所有20个物种中仍能提供超过6500个1:1直系同源物,同时在准确性和召回率方面保持比其他流程更好的组合。Orthonome被实现为一个可搜索的数据库和查询工具,以及用于所有直系同源物集的多序列比对浏览器。基础文档和数据库可在http://www.orthonome.com获取。
我们证明,在与先前发布的流程一起测试时,Orthonome在完整和草图果蝇基因组上提供了直系同源物捕获率和准确性的卓越组合。该研究还强调了果蝇物种间的进化保守程度比先前认为的更高。