Department of Computer Science, Eidgenössische Technische Hochschule Zurich, Zürich, Switzerland.
Mol Biol Evol. 2012 Jan;29(1):271-7. doi: 10.1093/molbev/msr198. Epub 2011 Aug 11.
Codon substitution models have traditionally been parametric Markov models, but recently, empirical and semiempirical models also have been proposed. Parametric codon models are typically based on 61×61 rate matrices that are derived from a small number of parameters. These parameters are rooted in experience and theoretical considerations and generally show good performance but are still relatively arbitrary. We have previously used principal component analysis (PCA) on data obtained from mammalian sequence alignments to empirically identify the most relevant parameters for codon substitution models, thereby confirming some commonly used parameters but also suggesting new ones. Here, we present a new semiempirical codon substitution model that is directly based on those PCA results. The substitution rate matrix is constructed from linear combinations of the first few (the most important) principal components with the coefficients being free model parameters. Thus, the model is not only based on empirical rates but also uses the empirically determined most relevant parameters for a codon model to adjust to the particularities of individual data sets. In comparisons against established parametric and semiempirical models, the new model consistently achieves the highest likelihood values when applied to sequences of vertebrates, which include the taxonomic class where the model was trained on.
密码子替换模型传统上是参数马尔可夫模型,但最近也提出了经验和半经验模型。参数密码子模型通常基于从少数参数中得出的 61×61 速率矩阵。这些参数源于经验和理论考虑,通常表现良好,但仍然相对任意。我们之前使用主成分分析(PCA)对从哺乳动物序列比对中获得的数据进行分析,以从经验上确定密码子替换模型最相关的参数,从而确认了一些常用的参数,但也提出了一些新的参数。在这里,我们提出了一种新的半经验密码子替换模型,该模型直接基于这些 PCA 结果。替换率矩阵是由前几个(最重要的)主成分的线性组合构建的,系数是自由模型参数。因此,该模型不仅基于经验速率,还使用经验确定的最相关的密码子模型参数来适应特定数据集的特殊性。在与已建立的参数和半经验模型的比较中,当应用于包括模型训练的分类单元的脊椎动物序列时,新模型始终获得最高的似然值。