Chaudhary Ruchi, Boussau Bastien, Burleigh J Gordon, Fernández-Baca David
Department of Computer Science, Iowa State University, Ames, IA 50011, USA; Department of Biology, University of Florida, Gainesville, FL 32611, USA; and Université de Lyon, Université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne F-69622, France Department of Computer Science, Iowa State University, Ames, IA 50011, USA; Department of Biology, University of Florida, Gainesville, FL 32611, USA; and Université de Lyon, Université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne F-69622, France
Department of Computer Science, Iowa State University, Ames, IA 50011, USA; Department of Biology, University of Florida, Gainesville, FL 32611, USA; and Université de Lyon, Université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne F-69622, France.
Syst Biol. 2015 Mar;64(2):325-39. doi: 10.1093/sysbio/syu128. Epub 2014 Dec 23.
With the availability of genomic sequence data, there is increasing interest in using genes with a possible history of duplication and loss for species tree inference. Here we assess the performance of both nonprobabilistic and probabilistic species tree inference approaches using gene duplication and loss and coalescence simulations. We evaluated the performance of gene tree parsimony (GTP) based on duplication (Only-dup), duplication and loss (Dup-loss), and deep coalescence (Deep-c) costs, the NJst distance method, the MulRF supertree method, and PHYLDOG, which jointly estimates gene trees and species tree using a hierarchical probabilistic model. We examined the effects of gene tree and species sampling, gene tree error, and duplication and loss rates on the accuracy of phylogenetic estimates. In the 10-taxon duplication and loss simulation experiments, MulRF is more accurate than the other methods when the duplication and loss rates are low, and Dup-loss is generally the most accurate when the duplication and loss rates are high. PHYLDOG performs well in 10-taxon duplication and loss simulations, but its run time is prohibitively long on larger data sets. In the larger duplication and loss simulation experiments, MulRF outperforms all other methods in experiments with at most 100 taxa; however, in the larger simulation, Dup-loss generally performs best. In all duplication and loss simulation experiments with more than 10 taxa, all methods perform better with more gene trees and fewer missing sequences, and they are all affected by gene tree error. Our results also highlight high levels of error in estimates of duplications and losses from GTP methods and demonstrate the usefulness of methods based on generic tree distances for large analyses.
随着基因组序列数据的可得性,人们越来越有兴趣使用可能经历过复制和丢失的基因来进行物种树推断。在这里,我们使用基因复制和丢失以及合并模拟来评估非概率和概率物种树推断方法的性能。我们基于复制(仅复制)、复制和丢失(复制-丢失)以及深度合并(深度合并)成本评估了基因树简约法(GTP)的性能、NJst距离方法、MulRF超级树方法以及PHYLDOG,后者使用分层概率模型联合估计基因树和物种树。我们研究了基因树和物种抽样、基因树错误以及复制和丢失率对系统发育估计准确性的影响。在10分类群的复制和丢失模拟实验中,当复制和丢失率较低时,MulRF比其他方法更准确,而当复制和丢失率较高时,复制-丢失通常是最准确的。PHYLDOG在10分类群的复制和丢失模拟中表现良好,但在更大的数据集上其运行时间长得令人望而却步。在更大规模的复制和丢失模拟实验中,在最多100个分类群的实验中,MulRF优于所有其他方法;然而,在更大规模的模拟中,复制-丢失通常表现最佳。在所有超过10个分类群的复制和丢失模拟实验中,所有方法在有更多基因树和更少缺失序列时表现更好,并且它们都受到基因树错误的影响。我们的结果还突出了GTP方法在复制和丢失估计中的高误差水平,并证明了基于通用树距离的方法在大型分析中的有用性。