当进化具有异质性时最大简约法和似然法系统发育分析的性能
Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous.
作者信息
Kolaczkowski Bryan, Thornton Joseph W
机构信息
Department of Computer and Information Science, University of Oregon, Eugene, Oregon 97403, USA.
出版信息
Nature. 2004 Oct 21;431(7011):980-4. doi: 10.1038/nature02917.
All inferences in comparative biology depend on accurate estimates of evolutionary relationships. Recent phylogenetic analyses have turned away from maximum parsimony towards the probabilistic techniques of maximum likelihood and bayesian Markov chain Monte Carlo (BMCMC). These probabilistic techniques represent a parametric approach to statistical phylogenetics, because their criterion for evaluating a topology--the probability of the data, given the tree--is calculated with reference to an explicit evolutionary model from which the data are assumed to be identically distributed. Maximum parsimony can be considered nonparametric, because trees are evaluated on the basis of a general metric--the minimum number of character state changes required to generate the data on a given tree--without assuming a specific distribution. The shift to parametric methods was spurred, in large part, by studies showing that although both approaches perform well most of the time, maximum parsimony is strongly biased towards recovering an incorrect tree under certain combinations of branch lengths, whereas maximum likelihood is not. All these evaluations simulated sequences by a largely homogeneous evolutionary process in which data are identically distributed. There is ample evidence, however, that real-world gene sequences evolve heterogeneously and are not identically distributed. Here we show that maximum likelihood and BMCMC can become strongly biased and statistically inconsistent when the rates at which sequence sites evolve change non-identically over time. Maximum parsimony performs substantially better than current parametric methods over a wide range of conditions tested, including moderate heterogeneity and phylogenetic problems not normally considered difficult.
比较生物学中的所有推断都依赖于对进化关系的准确估计。最近的系统发育分析已从最大简约法转向最大似然法和贝叶斯马尔可夫链蒙特卡罗(BMCMC)等概率技术。这些概率技术代表了统计系统发育学的一种参数化方法,因为它们评估拓扑结构的标准——给定树时数据的概率——是根据一个明确的进化模型计算得出的,该模型假设数据是同分布的。最大简约法可被视为非参数化方法,因为树是基于一种通用度量——在给定树上生成数据所需的最少字符状态变化数——进行评估的,而不假设特定分布。向参数化方法的转变在很大程度上是由一些研究推动的,这些研究表明,尽管两种方法在大多数情况下都表现良好,但在某些分支长度组合下,最大简约法强烈倾向于恢复一棵错误的树,而最大似然法并非如此。所有这些评估都是通过一个基本均匀的进化过程模拟序列,其中数据是同分布的。然而,有充分的证据表明,现实世界中的基因序列进化是异质的,并非同分布。我们在此表明,当序列位点的进化速率随时间非均匀变化时,最大似然法和BMCMC可能会产生强烈偏差且在统计上不一致。在包括中等异质性和通常不被认为困难的系统发育问题在内的广泛测试条件下,最大简约法的表现明显优于当前的参数化方法。