Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Syst Biol. 2018 Mar 1;67(2):285-303. doi: 10.1093/sysbio/syx077.
With the increasing availability of whole genome data, many species trees are being constructed from hundreds to thousands of loci. Although concatenation analysis using maximum likelihood is a standard approach for estimating species trees, it does not account for gene tree heterogeneity, which can occur due to many biological processes, such as incomplete lineage sorting. Coalescent species tree estimation methods, many of which are statistically consistent in the presence of incomplete lineage sorting, include Bayesian methods that coestimate the gene trees and the species tree, summary methods that compute the species tree by combining estimated gene trees, and site-based methods that infer the species tree from site patterns in the alignments of different loci. Due to concerns that poor quality loci will reduce the accuracy of estimated species trees, many recent phylogenomic studies have removed or filtered genes on the basis of phylogenetic signal and/or missing data prior to inferring species trees; little is known about the performance of species tree estimation methods when gene filtering is performed. We examine how incomplete lineage sorting, phylogenetic signal of individual loci, and missing data affect the absolute and the relative accuracy of species tree estimation methods and show how these properties affect methods' responses to gene filtering strategies. In particular, summary methods (ASTRAL-II, ASTRID, and MP-EST), a site-based coalescent method (SVDquartets within PAUP*), and an unpartitioned concatenation analysis using maximum likelihood (RAxML) were evaluated on a heterogeneous collection of simulated multilocus data sets, and the following trends were observed. Filtering genes based on gene tree estimation error improved the accuracy of the summary methods when levels of incomplete lineage sorting were low to moderate but did not benefit the summary methods under higher levels of incomplete lineage sorting, unless gene tree estimation error was also extremely high (a model condition with few replicates). Neither SVDquartets nor concatenation analysis using RAxML benefited from filtering genes on the basis of gene tree estimation error. Finally, filtering genes based on missing data was either neutral (i.e., did not impact accuracy) or else reduced the accuracy of all five methods. By providing insight into the consequences of gene filtering, we offer recommendations for estimating species tree in the presence of incomplete lineage sorting and reconcile seemingly conflicting observations made in prior studies regarding the impact of gene filtering.
随着全基因组数据的日益普及,许多物种树正在从数百个到数千个基因座中构建。虽然使用最大似然法的串联分析是估计物种树的标准方法,但它没有考虑到由于许多生物过程(如不完全谱系分选)而产生的基因树异质性。合并物种树估计方法,其中许多在不完全谱系分选的情况下在统计上是一致的,包括贝叶斯方法,该方法共同估计基因树和物种树,总结方法,通过组合估计的基因树计算物种树,以及基于位点的方法,该方法从不同基因座的比对中的位点模式推断物种树。由于担心质量较差的基因座会降低估计的物种树的准确性,许多最近的系统基因组学研究已经根据系统发育信号和/或缺失数据对基因进行了去除或过滤,然后才推断出物种树;对于基因过滤后物种树估计方法的性能知之甚少。我们研究了不完全谱系分选、个别基因座的系统发育信号和缺失数据如何影响物种树估计方法的绝对和相对准确性,并展示了这些特性如何影响方法对基因过滤策略的反应。特别是,基于汇总方法(ASTRAL-II、ASTRID 和 MP-EST)、基于位点的合并方法(PAUP* 中的 SVDquartets)和无分区串联分析(使用最大似然的 RAxML)对一组异质模拟多基因座数据集进行了评估,并观察到以下趋势。基于基因树估计误差过滤基因可以提高汇总方法在不完全谱系分选水平较低到中等时的准确性,但在不完全谱系分选水平较高时对汇总方法没有益处,除非基因树估计误差也非常高(模型条件下重复次数少)。SVDquartets 和使用 RAxML 的串联分析都不能从基于基因树估计误差的基因过滤中受益。最后,基于缺失数据过滤基因要么是中性的(即不影响准确性),要么会降低所有五种方法的准确性。通过深入了解基因过滤的后果,我们为在不完全谱系分选的情况下估计物种树提供了建议,并调和了先前研究中关于基因过滤影响的看似相互矛盾的观察结果。