Mendes Fábio K, Hahn Matthew W
Department of Biology, Indiana University, Bloomington, IN 47405, USA.
Department of Computer Science, Indiana University, Bloomington, IN 47405, USA.
Syst Biol. 2018 Jan 1;67(1):158-169. doi: 10.1093/sysbio/syx063.
Genome-scale sequencing has been of great benefit in recovering species trees but has not provided final answers. Despite the rapid accumulation of molecular sequences, resolving short and deep branches of the tree of life has remained a challenge and has prompted the development of new strategies that can make the best use of available data. One such strategy-the concatenation of gene alignments-can be successful when coupled with many tree estimation methods, but has also been shown to fail when there are high levels of incomplete lineage sorting. Here, we focus on the failure of likelihood-based methods in retrieving a rooted, asymmetric four-taxon species tree from concatenated data when the species tree is in or near the anomaly zone-a region of parameter space where the most common gene tree does not match the species tree because of incomplete lineage sorting. First, we use coalescent theory to prove that most informative sites will support the species tree in the anomaly zone, and that as a consequence maximum-parsimony succeeds in recovering the species tree from concatenated data. We further show that maximum-likelihood tree estimation from concatenated data fails both inside and outside the anomaly zone, and that this failure cannot be easily predicted from the topology of the most common gene tree. We demonstrate that likelihood-based methods often fail in a region partially overlapping the anomaly zone, likely because of the lower relative cost of substitutions on discordant gene tree branches that are absent from the species tree. Our results confirm and extend previous reports on the performance of these methods applied to concatenated data from a rooted, asymmetric four-taxon species tree, and highlight avenues for future work improving the performance of methods aimed at recovering species tree.
基因组规模测序在重建物种树方面有很大帮助,但尚未给出最终答案。尽管分子序列迅速积累,但解析生命之树的短而深的分支仍然是一项挑战,并促使人们开发能够充分利用现有数据的新策略。一种这样的策略——基因比对的串联——与许多树估计方法结合使用时可能会成功,但当存在高水平的不完全谱系分选时也会失败。在这里,我们关注基于似然性的方法在从串联数据中检索有根的、不对称的四分类群物种树时的失败情况,当物种树处于或接近异常区——参数空间中的一个区域,由于不完全谱系分选,最常见的基因树与物种树不匹配。首先,我们使用溯祖理论证明,大多数信息位点将在异常区支持物种树,因此最大简约法成功地从串联数据中重建了物种树。我们进一步表明,从串联数据进行最大似然树估计在异常区内和异常区外都会失败,而且这种失败不能轻易地从最常见基因树的拓扑结构预测出来。我们证明基于似然性的方法在部分与异常区重叠的区域经常失败,可能是因为物种树中不存在的不一致基因树分支上的替换相对成本较低。我们的结果证实并扩展了之前关于将这些方法应用于有根的、不对称的四分类群物种树的串联数据时性能的报告,并突出了未来工作中提高旨在重建物种树的方法性能的途径。