Misra Navodit, Blelloch Guy, Ravi R, Schwartz Russell
Max Planck Institute for Molecular Genetics, Berlin, Germany.
J Comput Biol. 2011 Nov;18(11):1599-609. doi: 10.1089/cmb.2011.0164. Epub 2011 Sep 27.
Much modern work in phylogenetics depends on statistical sampling approaches to phylogeny construction to estimate probability distributions of possible trees for any given input data set. Our theoretical understanding of sampling approaches to phylogenetics remains far less developed than that for optimization approaches, however, particularly with regard to the number of sampling steps needed to produce accurate samples of tree partition functions. Despite the many advantages in principle of being able to sample trees from sophisticated probabilistic models, we have little theoretical basis for concluding that the prevailing sampling approaches do in fact yield accurate samples from those models within realistic numbers of steps. We propose a novel approach to phylogenetic sampling intended to be both efficient in practice and more amenable to theoretical analysis than the prevailing methods. The method depends on replacing the standard tree rearrangement moves with an alternative Markov model in which one solves a theoretically hard but practically tractable optimization problem on each step of sampling. The resulting method can be applied to a broad range of standard probability models, yielding practical algorithms for efficient sampling and rigorous proofs of accurate sampling for heated versions of some important special cases. We demonstrate the efficiency and versatility of the method by an analysis of uncertainty in tree inference over varying input sizes. In addition to providing a new practical method for phylogenetic sampling, the technique is likely to prove applicable to many similar problems involving sampling over combinatorial objects weighted by a likelihood model.
现代系统发育学中的许多工作都依赖于用于构建系统发育树的统计抽样方法,以估计任何给定输入数据集可能的树的概率分布。然而,我们对系统发育学抽样方法的理论理解远不如对优化方法的理解成熟,特别是在生成树分区函数的准确样本所需的抽样步骤数量方面。尽管从复杂概率模型中抽样树原则上有许多优点,但我们几乎没有理论依据得出这样的结论:现行的抽样方法实际上能在实际可行的步骤数内从这些模型中产生准确的样本。我们提出了一种新的系统发育抽样方法,旨在在实践中高效且比现行方法更易于进行理论分析。该方法依赖于用一种替代的马尔可夫模型取代标准的树重排操作,在抽样的每一步中,人们要解决一个理论上困难但实际可处理的优化问题。由此产生的方法可应用于广泛的标准概率模型,为高效抽样提供实用算法,并为一些重要特殊情况的加热版本提供准确抽样的严格证明。我们通过分析不同输入规模下树推断的不确定性来证明该方法的效率和通用性。除了为系统发育抽样提供一种新的实用方法外,该技术可能还适用于许多类似问题,这些问题涉及对由似然模型加权的组合对象进行抽样。