Roch Sebastien, Steel Mike
Department of Mathematics, University of Wisconsin-Madison, Madison, WI, USA.
MS Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand.
Theor Popul Biol. 2015 Mar;100C:56-62. doi: 10.1016/j.tpb.2014.12.005. Epub 2014 Dec 26.
The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multispecies coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGIorgio and Degnan (2010) and Chifman and Kubtako (2014).
从基因组数据重建物种树面临双重障碍。首先,描述每个基因进化的(基因)树可能与物种树不同,例如,由于不完全谱系分选。其次,每个基因树叶子处的比对遗传序列仅提供了对基因树拓扑结构的不完美估计。在本笔记中,我们正式证明,如果试图避免考虑这两个过程并直接通过拼接方法分析遗传数据,就会出现一个基本的统计问题。更确切地说,我们表明,在具有标准位点替换模型的多物种合并模型下,在所有位点在固定树上独立且同分布进化的错误假设下,对跨基因拼接的序列数据进行最大似然估计是物种树的一个统计不一致估计量。我们的结果为Kubatko和Degnan(2007年)等人描述的模拟结果提供了形式上的证明,并补充了DeGIorgio和Degnan(2010年)以及Chifman和Kubtako(2014年)最近的理论结果。