Bayzid Md Shamsuzzoha, Warnow Tandy
Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA.
J Comput Biol. 2012 Jun;19(6):591-605. doi: 10.1089/cmb.2012.0037.
The estimation of species trees typically involves the estimation of trees and alignments on many different genes, so that the species tree can be based on many different parts of the genome. This kind of phylogenomic approach to species tree estimation has the potential to produce more accurate species tree estimates, especially when gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss, and horizontal gene transfer. Because ILS (also called "deep coalescence") is a frequent problem in systematics, many methods have been developed to estimate species trees from gene trees or alignments that specifically take ILS into consideration. In this paper we consider the problem of estimating species trees from gene trees and alignments for the general case where the gene trees and alignments can be incomplete, which means that not all the genes contain sequences for all the species. We formalize optimization problems for this context and prove theoretical results for these problems. We also present the results of a simulation study evaluating existing methods for estimating species trees from incomplete gene trees. Our simulation study shows that *BEAST, a statistical method for estimating species trees from gene sequence alignments, produces by far the most accurate species trees. However, *BEAST can only be run on small datasets. The second most accurate method, MRP (a standard supertree method), can analyze very large datasets and produces very good trees, making MRP a potentially acceptable alternative to *BEAST for large datasets.
物种树的估计通常涉及对许多不同基因的树和比对进行估计,以便物种树能够基于基因组的许多不同部分。这种用于物种树估计的系统发育基因组学方法有潜力产生更准确的物种树估计,特别是当基因树由于不完全谱系分选(ILS)、基因复制和丢失以及水平基因转移等过程而与物种树不同时。由于ILS(也称为“深度合并”)在系统分类学中是一个常见问题,已经开发了许多方法来从基因树或比对中估计物种树,这些方法特别考虑了ILS。在本文中,我们考虑在基因树和比对可能不完整的一般情况下,从基因树和比对中估计物种树的问题,这意味着并非所有基因都包含所有物种的序列。我们将此背景下的优化问题形式化,并证明这些问题的理论结果。我们还展示了一项模拟研究的结果,该研究评估了从不完整基因树估计物种树的现有方法。我们的模拟研究表明,*BEAST(一种从基因序列比对估计物种树的统计方法)迄今为止产生的物种树最准确。然而,BEAST只能在小数据集上运行。第二准确的方法是MRP(一种标准的超级树方法),它可以分析非常大的数据集并产生非常好的树,这使得MRP对于大型数据集而言是BEAST的一个潜在可接受的替代方法。