Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1205, Bangladesh.
Applied Statistics and Data Science (ASDS), Department of Statistics, Jahangirnagar University, Dhaka 1342, Bangladesh.
Syst Biol. 2021 Oct 13;70(6):1213-1231. doi: 10.1093/sysbio/syab026.
Species tree estimation from multilocus data sets is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have extended and adapted the concept of phylogenetic terraces to species tree estimation by "summarizing" a set of gene trees, where multiple species trees with distinct topologies may have exactly the same optimality score (i.e., quartet score, extra lineage score, etc.). We particularly investigated the presence and impacts of equally optimal trees in species tree estimation from multilocus data using summary methods by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. We present a comprehensive comparative study of these two optimality criteria. Our experiments, on a collection of data sets simulated under ILS, indicate that MDC may result in competitive or identical quartet consistency score as MQC, but could be significantly worse than MQC in terms of tree accuracy-demonstrating the presence and impacts of equally optimal species trees. This is the first known study that provides the conditions for the data sets to have equally optimal trees in the context of phylogenomic inference using summary methods. [Gene tree; incomplete lineage sorting; phylogenomic analysis, species tree; summary method.].
从多基因数据集估计物种树极具挑战性,特别是在由于不完全谱系分选(ILS)而导致基因组中存在基因树异质性的情况下。已经开发了汇总方法来估计基因树,然后通过优化各种优化分数来合并基因树以估计物种树。在这项研究中,我们通过“汇总”一组基因树,将系统发育阶地的概念扩展并应用于物种树估计,其中具有不同拓扑结构的多个物种树可能具有完全相同的最优性得分(即四分体得分,额外谱系得分等)。我们特别研究了在考虑 ILS 的情况下,使用汇总方法从多基因数据估计物种树中具有相同最优性的树的存在和影响。我们分析了两种最流行的 ILS 感知优化标准:四分体一致性最大化(MQC)和深度合并最小化(MDC)。基于 MQC 的方法在统计学上是可证明一致的,而 MDC 不是物种树估计的一致标准。我们对这两个最优性标准进行了全面的比较研究。我们在 ILS 下模拟的数据集上的实验表明,MDC 可能会导致四分体一致性得分与 MQC 相当或相同,但在树准确性方面可能比 MQC 差很多-证明了具有相同最优性的物种树的存在和影响。这是首次在使用汇总方法进行基因组推断的上下文中提供数据集具有相同最优树的条件的研究。[基因树;不完全谱系分选;系统发育基因组分析;物种树;汇总方法。]