Katriel Guy, Mahanaymi Udi, Brezner Shelly, Kezel Noor, Koutschan Christoph, Zeilberger Doron, Steel Mike, Snir Sagi
Department of Mathematics, Braude College of Engineering, Karmiel, Israel.
Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel.
Syst Biol. 2023 Dec 30;72(6):1403-1417. doi: 10.1093/sysbio/syad060.
The genomic era has opened up vast opportunities in molecular systematics, one of which is deciphering the evolutionary history in fine detail. Under this mass of data, analyzing the point mutations of standard markers is often too crude and slow for fine-scale phylogenetics. Nevertheless, genome dynamics (GD) events provide alternative, often richer information. The synteny index (SI) between a pair of genomes combines gene order and gene content information, allowing the comparison of genomes of unequal gene content, together with order considerations of their common genes. Recently, genome dynamics has been modeled as a continuous-time Markov process, and gene distance in the genome as a birth-death-immigration process. Nevertheless, due to complexities arising in this setting, no precise and provably consistent estimators could be derived, resulting in heuristic solutions. Here, we extend this modeling approach by using techniques from birth-death theory to derive explicit expressions of the system's probabilistic dynamics in the form of rational functions of the model parameters. This, in turn, allows us to infer analytically accurate distances between organisms based on their SI. Subsequently, we establish additivity of this estimated evolutionary distance (a desirable property yielding phylogenetic consistency). Applying the new measure in simulation studies shows that it provides accurate results in realistic settings and even under model extensions such as gene gain/loss or over a tree structure. In the real-data realm, we applied the new formulation to unique data structure that we constructed-the ordered orthology DB-based on a new version of the EggNOG database, to construct a tree with more than 4.5K taxa. To the best of our knowledge, this is the largest gene-order-based tree constructed and it overcomes shortcomings found in previous approaches. Constructing a GD-based tree allows to confirm and contrast findings based on other phylogenetic approaches, as we show.
基因组时代为分子系统学带来了巨大机遇,其中之一就是详细解读进化历史。在海量数据之下,分析标准标记的点突变对于精细尺度的系统发育学而言往往过于粗略且缓慢。然而,基因组动态(GD)事件提供了其他信息,且这些信息通常更为丰富。一对基因组之间的共线性指数(SI)结合了基因顺序和基因内容信息,使得基因含量不等的基因组能够进行比较,同时还考虑了它们共同基因的顺序。最近,基因组动态已被建模为连续时间马尔可夫过程,基因组中的基因距离则被建模为出生 - 死亡 - 迁入过程。然而,由于这种情况下会出现复杂性,无法得出精确且可证明一致的估计量,只能采用启发式解决方案。在此,我们通过运用出生 - 死亡理论的技术扩展这种建模方法,以模型参数的有理函数形式推导出系统概率动态的显式表达式。这进而使我们能够基于生物体的共线性指数(SI)分析推断出准确的距离。随后,我们确立了这种估计的进化距离的可加性(这是产生系统发育一致性的理想属性)。在模拟研究中应用这种新度量表明,它在实际情况下甚至在诸如基因获得/丢失等模型扩展情况下或在树形结构上都能提供准确结果。在实际数据领域,我们将新公式应用于我们构建的独特数据结构——基于新版EggNOG数据库的有序直系同源数据库,以构建一棵包含超过4500个分类单元的树。据我们所知,这是构建的基于基因顺序的最大的树,它克服了先前方法中发现的缺点。正如我们所展示的,构建基于基因组动态的树能够确认并对比基于其他系统发育方法的研究结果。