Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland.
SIB Swiss Institute of Bioinformatics, Quartier Sorge, Batiment Genopode, Lausanne, 1015, Switzerland.
BMC Bioinformatics. 2019 May 6;20(1):228. doi: 10.1186/s12859-019-2828-z.
An orthologous group (OG) comprises a set of orthologous and paralogous genes that share a last common ancestor (LCA). OGs are defined with respect to a chosen taxonomic level, which delimits the position of the LCA in time to a specified speciation event. A hierarchy of OGs expands on this notion, connecting more general OGs, distant in time, to more recent, fine-grained OGs, thereby spanning multiple levels of the tree of life. Large scale inference of OG hierarchies with independently computed taxonomic levels can suffer from inconsistencies between successive levels, such as the position in time of a duplication event. This can be due to confounding genetic signal or algorithmic limitations. Importantly, inconsistencies limit the potential use of OGs for functional annotation and third-party applications.
Here we present a new methodology to ensure hierarchical consistency of OGs across taxonomic levels. To resolve an inconsistency, we subsample the protein space of the OG members and perform gene tree-species tree reconciliation for each sampling. Differently from previous approaches, by subsampling the protein space, we avoid the notoriously difficult task of accurately building and reconciling very large phylogenies. We implement the method into a high-throughput pipeline and apply it to the eggNOG database. We use independent protein domain definitions to validate its performance.
The presented consistency pipeline shows that, contrary to previous limitations, tree reconciliation can be a useful instrument for the construction of OG hierarchies. The key lies in the combination of sampling smaller trees and aggregating their reconciliations for robustness. Results show comparable or greater performance to previous pipelines. The code is available on Github at: https://github.com/meringlab/og_consistency_pipeline .
直系同源群(OG)包含一组具有共同最近祖先(LCA)的直系同源和旁系同源基因。OG 是相对于选定的分类学级别定义的,该级别限定了 LCA 在时间上的位置,以指定的物种形成事件为准。OG 层次结构扩展了这一概念,将时间上较远的更一般的 OG 与更近的、更精细的 OG 连接起来,从而跨越了生命之树的多个层次。具有独立计算的分类学级别大规模推断 OG 层次结构可能会受到连续级别之间的不一致的影响,例如重复事件的时间位置。这可能是由于混杂的遗传信号或算法限制。重要的是,不一致限制了 OG 用于功能注释和第三方应用的潜力。
在这里,我们提出了一种新的方法,以确保 OG 在分类学级别上的层次一致性。为了解决不一致性,我们对 OG 成员的蛋白质空间进行了抽样,并对每个抽样进行了基因树-物种树的协调。与以前的方法不同,通过对蛋白质空间进行抽样,我们避免了准确构建和协调非常大的系统发育树这一众所周知的难题。我们将该方法实现到一个高通量管道中,并将其应用于 eggNOG 数据库。我们使用独立的蛋白质结构域定义来验证其性能。
所提出的一致性管道表明,与以前的限制相反,树协调可以成为构建 OG 层次结构的有用工具。关键在于结合抽样较小的树,并将它们的协调结果进行聚合以获得稳健性。结果表明,与以前的管道相比,该方法具有可比或更高的性能。代码可在 Github 上获得:https://github.com/meringlab/og_consistency_pipeline。