Bioinformatics and Systems Biology, UC San Diego, La Jolla, CA, USA.
Department of Electrical and Computer Engineering, UC San Diego, La Jolla, CA, USA.
Mol Biol Evol. 2022 Dec 5;39(12). doi: 10.1093/molbev/msac215.
Phylogenomic analyses routinely estimate species trees using methods that account for gene tree discordance. However, the most scalable species tree inference methods, which summarize independently inferred gene trees to obtain a species tree, are sensitive to hard-to-avoid errors introduced in the gene tree estimation step. This dilemma has created much debate on the merits of concatenation versus summary methods and practical obstacles to using summary methods more widely and to the exclusion of concatenation. The most successful attempt at making summary methods resilient to noisy gene trees has been contracting low support branches from the gene trees. Unfortunately, this approach requires arbitrary thresholds and poses new challenges. Here, we introduce threshold-free weighting schemes for the quartet-based species tree inference, the metric used in the popular method ASTRAL. By reducing the impact of quartets with low support or long terminal branches (or both), weighting provides stronger theoretical guarantees and better empirical performance than the unweighted ASTRAL. Our simulations show that weighting improves accuracy across many conditions and reduces the gap with concatenation in conditions with low gene tree discordance and high noise. On empirical data, weighting improves congruence with concatenation and increases support. Together, our results show that weighting, enabled by a new optimization algorithm we introduce, improves the utility of summary methods and can reduce the incongruence often observed across analytical pipelines.
系统发育基因组分析通常使用考虑基因树分歧的方法来估计物种树。然而,最具可扩展性的物种树推断方法是总结独立推断的基因树以获得物种树,这些方法对基因树估计步骤中难以避免的错误很敏感。这种困境引发了关于串联与总结方法的优点的争论,以及在更广泛地使用总结方法而排除串联方法方面存在实际障碍。使总结方法对嘈杂的基因树具有弹性的最成功尝试是从基因树中收缩低支持分支。不幸的是,这种方法需要任意的阈值,并带来了新的挑战。在这里,我们为基于四分体的物种树推断(流行方法 ASTRAL 中使用的度量标准)引入了无阈值加权方案。通过减少低支持或长末端分支(或两者)的四分体的影响,加权比未加权的 ASTRAL 提供了更强的理论保证和更好的经验性能。我们的模拟表明,加权在许多条件下提高了准确性,并在基因树分歧低且噪声高的条件下缩小了与串联的差距。在经验数据上,加权提高了与串联的一致性并增加了支持度。总之,我们的结果表明,加权(由我们引入的新优化算法实现)提高了总结方法的实用性,并可以减少分析管道中经常观察到的不一致性。