基于比对序列数据集串联的似然法树重建可能在统计上不一致。

Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent.

作者信息

Roch Sebastien, Steel Mike

机构信息

Department of Mathematics, University of Wisconsin-Madison, Madison, WI, USA.

MS Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand.

出版信息

Theor Popul Biol. 2015 Mar;100C:56-62. doi: 10.1016/j.tpb.2014.12.005. Epub 2014 Dec 26.

DOI:10.1016/j.tpb.2014.12.005

PMID:25545843

Abstract

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multispecies coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGIorgio and Degnan (2010) and Chifman and Kubtako (2014).

摘要

从基因组数据重建物种树面临双重障碍。首先，描述每个基因进化的（基因）树可能与物种树不同，例如，由于不完全谱系分选。其次，每个基因树叶子处的比对遗传序列仅提供了对基因树拓扑结构的不完美估计。在本笔记中，我们正式证明，如果试图避免考虑这两个过程并直接通过拼接方法分析遗传数据，就会出现一个基本的统计问题。更确切地说，我们表明，在具有标准位点替换模型的多物种合并模型下，在所有位点在固定树上独立且同分布进化的错误假设下，对跨基因拼接的序列数据进行最大似然估计是物种树的一个统计不一致估计量。我们的结果为Kubatko和Degnan（2007年）等人描述的模拟结果提供了形式上的证明，并补充了DeGIorgio和Degnan（2010年）以及Chifman和Kubtako（2014年）最近的理论结果。