从基因树上推断物种树的阶。

Terraces in species tree inference from gene trees.

机构信息

Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.

出版信息

BMC Ecol Evol. 2024 Nov 4;24(1):135. doi: 10.1186/s12862-024-02309-z.

DOI:10.1186/s12862-024-02309-z

PMID:39497030

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11533290/

Abstract

A terrace in a phylogenetic tree space is a region where all trees contain the same set of subtrees, due to certain patterns of missing data among the taxa sampled, resulting in an identical optimality score for a given data set. This was first investigated in the context of phylogenetic tree estimation from sequence alignments using maximum likelihood (ML) and maximum parsimony (MP). It was later extended to the species tree inference problem from a collection of gene trees, where a set of equally optimal species trees was referred to as a "pseudo" species tree terrace which does not consider the topological proximity of the trees in terms of the induced subtrees resulting from certain patterns of missing data. In this study, we mathematically characterize species tree terraces and investigate the mathematical properties and conditions that lead multiple species trees to induce/display an identical set of locus-specific subtrees owing to missing data. We report that species tree terraces are agnostic to gene tree heterogeneity. Therefore, we introduce and characterize a special type of gene tree topology-aware terrace which we call "peak terrace". Moreover, we empirically investigated various challenges and opportunities related to species tree terraces through extensive empirical studies using simulated and real biological data. We demonstrate the prevalence of species tree terraces and the resulting ambiguity created for tree search algorithms. Remarkably, our findings indicate that the identification of terraces could potentially lead to advances that enhance the accuracy of summary methods and provide reasonably accurate branch support.

摘要

系统发育树空间中的树阶是指在采样分类单元中存在某些缺失数据模式的情况下，所有树都包含相同的子树集，从而导致给定数据集的相同最优得分的区域。这首先在使用最大似然法（ML）和最大简约法（MP）从序列比对中估计系统发育树的背景下进行了研究。后来，它被扩展到从一组基因树推断物种树的问题，其中一组同样最优的物种树被称为“伪”物种树阶，它不考虑由于某些缺失数据模式而导致的诱导子树的拓扑接近度。在这项研究中，我们从数学上刻画了物种树阶，并研究了导致多个物种树由于缺失数据而诱导/显示相同的局部子树的数学性质和条件。我们报告说，物种树阶与基因树异质性无关。因此，我们引入并刻画了一种特殊类型的基因树拓扑感知的阶，我们称之为“峰阶”。此外，我们通过使用模拟和真实生物数据的广泛实证研究，实证研究了与物种树阶相关的各种挑战和机遇。我们证明了物种树阶的普遍性，以及树搜索算法因此而产生的歧义。值得注意的是，我们的发现表明，识别阶可能会带来提高汇总方法准确性和提供合理准确分支支持的进展。