Arasti Shayesteh, Tabaghi Puoya, Tabatabaee Yasamin, Mirarab Siavash
bioRxiv. 2025 Mar 2:2023.11.13.566962. doi: 10.1101/2023.11.13.566962.
The abundant discordance between evolutionary relationships across the genome has rekindled interest in methods for comparing and averaging trees on a shared leaf set. However, most attempts at comparing and matching trees have focused on tree topology. Comparing branch lengths has been more elusive due to several challenges. Species tree branch lengths can be measured in various units, often different from gene trees. Moreover, rates of evolution change across the genome, the species tree, and specific branches of gene trees. These factors compound the stochasticity of coalescence times and estimation noise, making branch lengths highly heterogeneous across the genome. For many downstream applications in phylogenomic analyses, branch lengths are as important as the topology, and yet, existing tools to compare and combine weighted trees are limited. In this paper, we address the question of matching one tree to another, accounting for their branch lengths. We define a series of computational problems called Topology-Constrained Metric Matching (TCMM) that seek to transform the branch lengths of a query tree based on a reference tree. We show that TCMM problems can be solved in quadratic time and memory using a linear algebraic formulation coupled with dynamic programming preprocessing. While many applications can be imagined for this framework, we explore two applications in this paper: embedding leaves of gene trees in Euclidean space to find outliers potentially indicative of errors and summarizing gene tree branch lengths onto the species tree. In these applications, our method, when paired with existing methods, increases their accuracy at limited computational expense.
The software is available at https://github.com/shayesteh99/TCMM.git . Data is available on Github https://github.com/shayesteh99/TCMM-Data.git .
全基因组进化关系之间存在大量不一致,这重新引发了人们对在共享叶集上比较和平均树的方法的兴趣。然而,大多数比较和匹配树的尝试都集中在树的拓扑结构上。由于存在几个挑战,比较分支长度一直更加难以捉摸。物种树的分支长度可以用各种单位来衡量,这些单位通常与基因树的不同。此外,进化速率在全基因组、物种树以及基因树的特定分支中都会发生变化。这些因素加剧了合并时间的随机性和估计噪声,使得基因组中分支长度高度异质。在系统发育基因组分析的许多下游应用中,分支长度与拓扑结构同样重要,然而,现有的比较和组合加权树的工具却很有限。在本文中,我们解决了考虑分支长度将一棵树与另一棵树进行匹配的问题。我们定义了一系列称为拓扑约束度量匹配(TCMM)的计算问题,旨在根据参考树变换查询树的分支长度。我们表明,使用线性代数公式结合动态规划预处理,TCMM问题可以在二次时间和内存内解决。虽然可以想象这个框架有许多应用,但我们在本文中探索了两个应用:将基因树的叶嵌入欧几里得空间以找到可能指示错误的异常值,以及将基因树分支长度汇总到物种树上。在这些应用中,我们的方法与现有方法结合使用时,能以有限的计算成本提高其准确性。