Department of Integrative Biology, University of California, Berkeley, CA 94720, USA; Department of Biology, Pennsylvania State University, University Park, PA 16802, USA; and Department of Mathematics and Statistics, University of New Mexico, 1 University of New Mexico, Albuquerque, NM 87131, USA.
Syst Biol. 2014 Jan 1;63(1):66-82. doi: 10.1093/sysbio/syt059. Epub 2013 Aug 29.
To infer species trees from gene trees estimated from phylogenomic data sets, tractable methods are needed that can handle dozens to hundreds of loci. We examine several computationally efficient approaches-MP-EST, STAR, STEAC, STELLS, and STEM-for inferring species trees from gene trees estimated using maximum likelihood (ML) and Bayesian approaches. Among the methods examined, we found that topology-based methods often performed better using ML gene trees and methods employing coalescent times typically performed better using Bayesian gene trees, with MP-EST, STAR, STEAC, and STELLS outperforming STEM under most conditions. We examine why the STEM tree (also called GLASS or Maximum Tree) is less accurate on estimated gene trees by comparing estimated and true coalescence times, performing species tree inference using simulations, and analyzing a great ape data set keeping track of false positive and false negative rates for inferred clades. We find that although true coalescence times are more ancient than speciation times under the multispecies coalescent model, estimated coalescence times are often more recent than speciation times. This underestimation can lead to increased bias and lack of resolution with increased sampling (either alleles or loci) when gene trees are estimated with ML. The problem appears to be less severe using Bayesian gene-tree estimates.
为了从基于系统基因组数据集估计的基因树上推断物种树,需要使用能够处理数十到数百个基因座的可行方法。我们研究了几种计算效率高的方法——MP-EST、STAR、STEAC、STELLS 和 STEM——用于从最大似然(ML)和贝叶斯方法估计的基因树上推断物种树。在所检查的方法中,我们发现基于拓扑的方法通常在使用 ML 基因树时表现更好,而使用合并时间的方法通常在使用贝叶斯基因树时表现更好,在大多数情况下,MP-EST、STAR、STEAC 和 STELLS 的表现优于 STEM。我们通过比较估计的和真实的合并时间、使用模拟进行物种树推断以及分析大型猿类数据集来检查为什么 STEM 树(也称为 GLASS 或最大树)在估计的基因树上的准确性较低,同时跟踪推断的分支的假阳性和假阴性率。我们发现,尽管在多物种合并模型下,真实的合并时间比物种形成时间更古老,但估计的合并时间通常比物种形成时间更新。当使用 ML 估计基因树时,这种低估会导致随着样本量(无论是等位基因还是基因座)的增加,出现更大的偏差和分辨率不足的问题。使用贝叶斯基因树估计时,这个问题似乎不太严重。