Bioinformatics Research Center, Department of Genetics, North Carolina State University, NC, USA.
Syst Biol. 2012 Dec 1;61(6):927-40. doi: 10.1093/sysbio/sys046. Epub 2012 Apr 16.
Among models of nucleotide evolution, the Barry and Hartigan (BH) model (also known as the General Markov Model) is very flexible as it allows separate arbitrary substitution matrices along edges. For a given tree, the estimates of the BH model are a set of joint probability matrices, each giving the pairwise frequencies of nucleotides at the ends of the edge. We have previously shown that, due to an identifiability problem, these cannot be expected to consistently estimate the actual pairwise frequencies. A further consequence is that internal node frequency estimates are likely to be incorrect. Here we define a nonstationary GTR model for each edge that we refer to as the NSGTR model. We fit the NSGTR model by minimizing the sums of squares between the estimates of transition probabilities under the NSGTR model and the estimates provided by a fitted BH model. This NSGTR model provides estimates that avoid the identifiability difficulties of the BH model while closely fitting it. With the best-fitting NSGTR estimates, we are able to get interpretable frequency vectors at internal nodes as well as edge length estimates that are otherwise not yielded by the BH model. These edge lengths are interpretable as the expected number of substitutions along an edge for the model. We also show that for a nonstationary continuous-time model these are not the same as the edge length parameters for conventional substitution matrices that are output by nonstationary model phylogenetic estimation programs such as nhPhyML.
在核苷酸进化模型中,Barry 和 Hartigan(BH)模型(也称为通用马尔可夫模型)非常灵活,因为它允许在边缘上单独使用任意替换矩阵。对于给定的树,BH 模型的估计是一组联合概率矩阵,每个矩阵给出边缘末端核苷酸对的频率。我们之前已经表明,由于可识别性问题,这些矩阵不能期望一致地估计实际的核苷酸对频率。进一步的结果是内部节点频率估计可能不正确。在这里,我们为每个边缘定义一个非平稳 GTR 模型,我们称之为 NSGTR 模型。我们通过最小化 NSGTR 模型下的转移概率估计值与拟合的 BH 模型提供的估计值之间的平方和来拟合 NSGTR 模型。该 NSGTR 模型提供了估计值,避免了 BH 模型的可识别性问题,同时紧密拟合它。使用最佳拟合的 NSGTR 估计值,我们能够获得可解释的内部节点频率向量以及边缘长度估计值,否则 BH 模型无法提供这些估计值。这些边缘长度可以解释为模型中沿边缘的预期替换数量。我们还表明,对于非平稳连续时间模型,这些与非平稳模型系统发育估计程序(如 nhPhyML)输出的常规替换矩阵的边缘长度参数不同。