Lartillot Nicolas, Philippe Hervé
Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal, Montréal, Québec Canada.
Mol Biol Evol. 2004 Jun;21(6):1095-109. doi: 10.1093/molbev/msh112. Epub 2004 Mar 10.
Most current models of sequence evolution assume that all sites of a protein evolve under the same substitution process, characterized by a 20 x 20 substitution matrix. Here, we propose to relax this assumption by developing a Bayesian mixture model that allows the amino-acid replacement pattern at different sites of a protein alignment to be described by distinct substitution processes. Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use of a Dirichlet process prior, the total number of classes and their respective amino-acid profiles, as well as the affiliations of each site to a given class, are all free variables of the model. In this way, the CAT model is able to adapt to the complexity actually present in the data, and it yields an estimate of the substitutional heterogeneity through the posterior mean number of classes. We show that a significant level of heterogeneity is present in the substitution patterns of proteins, and that the standard one-matrix model fails to account for this heterogeneity. By evaluating the Bayes factor, we demonstrate that the standard model is outperformed by CAT on all of the data sets which we analyzed. Altogether, these results suggest that the complexity of the pattern of substitution of real sequences is better captured by the CAT model, offering the possibility of studying its impact on phylogenetic reconstruction and its connections with structure-function determinants.
当前大多数序列进化模型都假定蛋白质的所有位点都在相同的替换过程下进化,该过程由一个20×20的替换矩阵来表征。在此,我们提议通过开发一种贝叶斯混合模型来放宽这一假定,该模型允许用不同的替换过程来描述蛋白质比对中不同位点的氨基酸替换模式。我们的模型名为CAT,假定存在不同的过程(或类别),这些过程在20种氨基酸残基上的平衡频率有所不同。通过使用狄利克雷过程先验,类别总数及其各自的氨基酸分布,以及每个位点隶属于给定类别的情况,都是该模型的自由变量。通过这种方式,CAT模型能够适应数据中实际存在的复杂性,并通过类别后验平均数对替换异质性进行估计。我们表明,蛋白质的替换模式中存在显著水平的异质性,并且标准的单矩阵模型无法解释这种异质性。通过评估贝叶斯因子,我们证明在我们分析的所有数据集中,CAT模型都优于标准模型。总之,这些结果表明,CAT模型能更好地捕捉真实序列替换模式的复杂性,这为研究其对系统发育重建的影响以及与结构 - 功能决定因素的联系提供了可能性。