Stefankovic Daniel, Vigoda Eric
Department of Computer Science, University of Rochester, Rochester, New York 14627, USA.
Syst Biol. 2007 Feb;56(1):113-24. doi: 10.1080/10635150701245388.
Different genes often have different phylogenetic histories. Even within regions having the same phylogenetic history, the mutation rates often vary. We investigate the prospects of phylogenetic reconstruction when all the characters are generated from the same tree topology, but the branch lengths vary (with possibly different tree shapes). Furthering work of Kolaczkowski and Thornton (2004, Nature 431: 980-984) and Chang (1996, Math. Biosci. 134: 189-216), we show examples where maximum likelihood (under a homogeneous model) is an inconsistent estimator of the tree. We then explore the prospects of phylogenetic inference under a heterogeneous model. In some models, there are examples where phylogenetic inference under any method is impossible - despite the fact that there is a common tree topology. In particular, there are nonidentifiable mixture distributions, i.e., multiple topologies generate identical mixture distributions. We address which evolutionary models have nonidentifiable mixture distributions and prove that the following duality theorem holds for most DNA substitution models. The model has either: (i) nonidentifiability - two different tree topologies can produce identical mixture distributions, and hence distinguishing between the two topologies is impossible; or (ii) linear tests - there exist linear tests which identify the common tree topology for character data generated by a mixture distribution. The theorem holds for models whose transition matrices can be parameterized by open sets, which includes most of the popular models, such as Tamura-Nei and Kimura's 2-parameter model. The duality theorem relies on our notion of linear tests, which are related to Lake's linear invariants.
不同的基因往往具有不同的系统发育历史。即使在具有相同系统发育历史的区域内,突变率也常常有所不同。我们研究当所有特征都由相同的树拓扑结构生成,但分支长度不同(可能具有不同的树形)时进行系统发育重建的前景。在扩展了科拉茨科夫斯基和桑顿(2004年,《自然》431:980 - 984)以及张(1996年,《数学生物学》134:189 - 216)的工作基础上,我们展示了一些例子,其中最大似然法(在齐次模型下)是树的不一致估计量。然后我们探索在非齐次模型下进行系统发育推断的前景。在某些模型中,存在这样的例子,即尽管存在共同的树拓扑结构,但任何方法都无法进行系统发育推断。特别是,存在不可识别的混合分布,即多种拓扑结构会产生相同的混合分布。我们探讨哪些进化模型具有不可识别的混合分布,并证明以下对偶定理对大多数DNA替换模型成立。该模型要么:(i)具有不可识别性——两种不同的树拓扑结构可以产生相同的混合分布,因此无法区分这两种拓扑结构;要么(ii)具有线性检验——存在线性检验可以识别由混合分布生成的特征数据的共同树拓扑结构。该定理适用于其转移矩阵可以由开集参数化的模型,这包括大多数流行的模型,如田村 - 内模型和木村二参数模型。对偶定理依赖于我们的线性检验概念,它与莱克的线性不变量相关。