Liu Liang, Anderson Christian, Pearl Dennis, Edwards Scott V
Department of Statistics, University of Georgia, Athens, GA, USA.
Advantage Testing of Boston, Newton Centre, MA, USA.
Methods Mol Biol. 2019;1910:211-239. doi: 10.1007/978-1-4939-9074-0_7.
The multispecies coalescent (MSC) model provides a compelling framework for building phylogenetic trees from multilocus DNA sequence data. The pure MSC is best thought of as a special case of so-called "multispecies network coalescent" models, in which gene flow is allowed among branches of the tree, whereas MSC methods assume there is no gene flow between diverging species. Early implementations of the MSC, such as "parsimony" or "democratic vote" approaches to combining information from multiple gene trees, as well as concatenation, in which DNA sequences from multiple gene trees are combined into a single "supergene," were quickly shown to be inconsistent in some regions of tree space, in so far as they converged on the incorrect species tree as more gene trees and sequence data were accumulated. The anomaly zone, a region of tree space in which the most frequent gene tree is different from the species tree, is one such region where many so-called "coalescent" methods are inconsistent. Second-generation implementations of the MSC employed Bayesian or likelihood models; these are consistent in all regions of gene tree space, but Bayesian methods in particular are incapable of handling the large phylogenomic data sets currently available. Two-step methods, such as MP-EST and ASTRAL, in which gene trees are first estimated and then combined to estimate an overarching species tree, are currently popular in part because they can handle large phylogenomic data sets. These methods are consistent in the anomaly zone but can sometimes provide inappropriate measures of tree support or apportion error and signal in the data inappropriately. MP-EST in particular employs a likelihood model which can be conveniently manipulated to perform statistical tests of competing species trees, incorporating the likelihood of the collected gene trees on each species tree in a likelihood ratio test. Such tests provide a useful alternative to the multilocus bootstrap, which only indirectly tests the appropriateness of competing species trees. We illustrate these tests and implementations of the MSC with examples and suggest that MSC methods are a useful class of models effectively using information from multiple loci to build phylogenetic trees.
多物种合并(MSC)模型为从多位点DNA序列数据构建系统发育树提供了一个有说服力的框架。纯MSC最好被视为所谓“多物种网络合并”模型的一个特例,在该模型中,树的分支之间允许基因流动,而MSC方法假设不同物种之间不存在基因流动。MSC的早期实现方式,如将来自多个基因树的信息进行合并的“简约法”或“民主投票法”,以及将多个基因树的DNA序列合并成一个单一“超级基因”的串联法,很快就在树空间的某些区域被证明是不一致的,因为随着更多基因树和序列数据的积累,它们会收敛到错误的物种树上。异常区是树空间中的一个区域,其中最常见的基因树与物种树不同,是许多所谓“合并”方法不一致的一个这样的区域。MSC的第二代实现方式采用了贝叶斯或似然模型;这些在基因树空间的所有区域都是一致的,但特别是贝叶斯方法无法处理当前可用的大型系统发育基因组数据集。两步法,如MP-EST和ASTRAL,其中首先估计基因树,然后将它们合并以估计总体物种树,目前很受欢迎,部分原因是它们可以处理大型系统发育基因组数据集。这些方法在异常区是一致的,但有时可能会提供不适当的树支持度量,或者不恰当地分配数据中的误差和信号。特别是MP-EST采用了一个似然模型,该模型可以方便地进行操作以对竞争的物种树进行统计检验,在似然比检验中纳入每个物种树上收集的基因树的似然性。这样的检验为多位点自展提供了一个有用的替代方法,多位点自展只是间接检验竞争物种树的适当性。我们用例子说明了这些检验和MSC的实现方式,并表明MSC方法是一类有效地利用来自多个位点的信息来构建系统发育树的有用模型。