Yoshida Ruriko, Barnhill David, Miura Keiji, Howe Daniel
IEEE/ACM Trans Comput Biol Bioinform. 2024 Nov-Dec;21(6):1855-1863. doi: 10.1109/TCBB.2024.3420815. Epub 2024 Dec 10.
Much evidence from biological theory and empirical data indicates that, gene trees, phylogenetic trees reconstructed from different genes (loci), do not have to have exactly the same tree topologies. Such incongruence between gene trees might be caused by some "unusual" evolutionary events, such as meiotic sexual recombination in eukaryotes or horizontal transfers of genetic material in prokaryotes. However, most of the gene trees are constrained by the tree topology of the underlying species tree, that is, the phylogenetic tree depicting the evolutionary history of the set of species under consideration. In order to discover "outlying" gene trees which do not follow the "main distribution(s)" of trees, we propose to apply the "tropical metric" with the max-plus algebra from tropical geometry to a non-parametric estimation of gene trees over the space of phylogenetic trees. In this research we apply the "tropical metric," a well-defined metric over the space of phylogenetic trees under the max-plus algebra, to non-parametric estimation of gene trees distribution over the tree space. Kernel density estimator (KDE) is one of the most popular non-parametric estimation of a distribution from a given sample, and we propose an analogue of the classical KDE in the setting of tropical geometry with the tropical metric which measures the length of an intrinsic geodesic between trees over the tree space. We estimate the probability of an observed tree by empirical frequencies of nearby trees, with the level of influence determined by the tropical metric. Then, with simulated data generated from the multispecies coalescent model, we show that the non-parametric estimation of the gene tree distribution using the tropical metric performs better than one using the Billera-Holmes-Vogtmann (BHV) metric developed by Weyenberg et al. in terms of computational times and accuracy. We then apply it to Apicomplexa data.
来自生物学理论和实证数据的大量证据表明,基因树,即从不同基因(位点)重建的系统发育树,不一定具有完全相同的树形拓扑结构。基因树之间的这种不一致可能是由一些“异常”的进化事件引起的,例如真核生物中的减数分裂性重组或原核生物中的遗传物质水平转移。然而,大多数基因树受到基础物种树的树形拓扑结构的限制,也就是说,描绘所考虑物种集合进化历史的系统发育树。为了发现不遵循树的“主要分布”的“异常”基因树,我们建议将热带几何中的最大加代数的“热带度量”应用于系统发育树空间上基因树的非参数估计。在本研究中,我们将“热带度量”(一种在最大加代数下系统发育树空间上定义良好的度量)应用于树空间上基因树分布的非参数估计。核密度估计器(KDE)是从给定样本中对分布进行最流行的非参数估计之一,我们在热带几何的背景下提出了经典KDE的类似物,使用热带度量来测量树空间上树之间内在测地线的长度。我们通过附近树的经验频率来估计观察到的树的概率,其影响程度由热带度量确定。然后,使用从多物种合并模型生成的模拟数据,我们表明,在计算时间和准确性方面,使用热带度量对基因树分布进行非参数估计比使用Weyenberg等人开发的Billera-Holmes-Vogtmann(BHV)度量进行非参数估计表现更好。然后我们将其应用于顶复门数据。