混合模型的系统发育：最大似然法的稳健性与不可识别分布

Phylogeny of mixture models: robustness of maximum likelihood and non-identifiable distributions.

作者信息

Stefankovic Daniel, Vigoda Eric

机构信息

Department of Computer Science, University of Rochester, Rochester, New York 14627, USA.

出版信息

J Comput Biol. 2007 Mar;14(2):156-89. doi: 10.1089/cmb.2006.0126.

DOI:10.1089/cmb.2006.0126

PMID:17456014

Abstract

We address phylogenetic reconstruction when the data is generated from a mixture distribution. Such topics have gained considerable attention in the biological community with the clear evidence of heterogeneity of mutation rates. In our work we consider data coming from a mixture of trees which share a common topology, but differ in their edge weights (i.e., branch lengths). We first show the pitfalls of popular methods, including maximum likelihood and Markov chain Monte Carlo algorithms. We then determine in which evolutionary models, reconstructing the tree topology, under a mixture distribution, is (im)possible. We prove that every model whose transition matrices can be parameterized by an open set of multilinear polynomials, either has non-identifiable mixture distributions, in which case reconstruction is impossible in general, or there exist linear tests which identify the topology. This duality theorem, relies on our notion of linear tests and uses ideas from convex programming duality. Linear tests are closely related to linear invariants, which were first introduced by Lake, and are natural from an algebraic geometry perspective.

摘要

当数据由混合分布生成时，我们探讨系统发育重建问题。这类主题在生物学界已受到相当多的关注，有明确证据表明突变率存在异质性。在我们的工作中，我们考虑来自共享共同拓扑结构但边权重（即分支长度）不同的树的混合数据。我们首先展示了常用方法的缺陷，包括最大似然法和马尔可夫链蒙特卡罗算法。然后我们确定在哪些进化模型下，在混合分布下重建树拓扑结构是（不）可能的。我们证明，每个其转移矩阵可以由多线性多项式的开集参数化的模型，要么具有不可识别的混合分布，在这种情况下通常无法进行重建，要么存在识别拓扑结构的线性检验。这个对偶定理依赖于我们的线性检验概念，并运用了凸规划对偶性的思想。线性检验与线性不变量密切相关，线性不变量最早由莱克引入，从代数几何角度来看是很自然的。