BMC Bioinformatics. 2014;15 Suppl 2(Suppl 2):S8. doi: 10.1186/1471-2105-15-S2-S8. Epub 2014 Jan 31.
Under a Markov model of evolution, recoding, or lumping, of the four nucleotides into fewer groups may permit analysis under simpler conditions but may unfortunately yield misleading results unless the evolutionary process of the recoded groups remains Markovian. If a Markov process is lumpable, then the evolutionary process of the recoded groups is Markovian.
We consider stationary, reversible, and homogeneous Markov processes on two taxa and compare three tests for lumpability: one using an ad hoc test statistic, which is based on an index that is evaluated using a bootstrap approximation of its distribution; one that is based on a test proposed specifically for Markov chains; and one using a likelihood-ratio test. We show that the likelihood-ratio test is more powerful than the index test, which is more powerful than that based on the Markov chain test statistic. We also show that for stationary processes on binary trees with more than two taxa, the tests can be applied to all pairs. Finally, we show that if the process is lumpable, then estimates obtained under the recoded model agree with estimates obtained under the original model, whereas, if the process is not lumpable, then these estimates can differ substantially. We apply the new likelihood-ratio test for lumpability to two primate data sets, one with a mitochondrial origin and one with a nuclear origin.
Recoding may result in biased phylogenetic estimates because the original evolutionary process is not lumpable. Accordingly, testing for lumpability should be done prior to phylogenetic analysis of recoded data.
在进化的马尔可夫模型、重编码或聚类下,将四个核苷酸划分为更少的组可能允许在更简单的条件下进行分析,但除非重编码组的进化过程仍然是马尔可夫的,否则可能会产生误导性的结果。如果一个马尔可夫过程是可聚类的,那么重编码组的进化过程就是马尔可夫的。
我们考虑了两种分类单元上的固定、可逆和均匀的马尔可夫过程,并比较了三种聚类检验方法:一种使用特定于聚类的检验统计量,该统计量基于使用分布的自助逼近评估的指数;一种基于特别为马尔可夫链提出的检验方法;以及一种使用似然比检验。我们表明,似然比检验比基于指数的检验更有效,而基于指数的检验又比基于马尔可夫链检验统计量的检验更有效。我们还表明,对于二叉树上具有两个以上分类单元的固定过程,检验可以应用于所有对。最后,我们表明,如果过程是可聚类的,那么在重编码模型下获得的估计值与在原始模型下获得的估计值一致,而如果过程不可聚类,那么这些估计值可能会有很大的差异。我们将新的似然比检验用于聚类能力应用于两个灵长类数据集,一个具有线粒体起源,一个具有核起源。
重编码可能导致有偏差的系统发育估计,因为原始的进化过程是不可聚类的。因此,在对重编码数据进行系统发育分析之前,应该进行聚类能力检验。