Department of Biology, Institute of Biochemistry, Carleton University, Ottawa, Canada.
School of Mathematics and Statistics, Carleton University, 209 Nesbitt Biology Building, 1125 Colonel By Drive, Ottawa, ON, K1A 0C6, Canada.
J Mol Evol. 2022 Dec;90(6):468-475. doi: 10.1007/s00239-022-10076-y. Epub 2022 Oct 7.
Models of amino acid replacement are central to modern phylogenetic inference, particularly so when dealing with deep evolutionary relationships. Traditionally, a single, empirically derived matrix was utilized, so as to keep the degrees-of-freedom of the inference low, and focused on topology. With the growing size of data sets, however, an amino acid-level general-time-reversible matrix has become increasingly feasible, treating amino acid exchangeabilities and frequencies as free parameters. Moreover, models based on mixtures of multiple matrices are increasingly utilized, in order to account for across-site heterogeneities in amino acid requirements of proteins. Such models exist as finite empirically-derived amino acid profile (or frequency) mixtures, free finite mixtures, as well as free Dirichlet process-based infinite mixtures. All of these approaches are typically combined with a gamma-distributed rates-across-sites model. In spite of the availability of these different aspects to modeling the amino acid replacement process, no study has systematically quantified their relative contributions to their predictive power of real data. Here, we use Bayesian cross-validation to establish a detailed comparison, while activating/deactivating each modeling aspect. For most data sets studied, we find that amino acid mixture models can outrank all single-matrix models, even when the latter include gamma-distributed rates and the former do not. We also find that free finite mixtures consistently outperform empirical finite mixtures. Finally, the Dirichlet process-based mixture model tends to outperform all other approaches.
氨基酸替换模型是现代系统发育推断的核心,尤其是在处理深度进化关系时。传统上,使用单个经验衍生的矩阵来保持推断的自由度低,并专注于拓扑结构。然而,随着数据集规模的不断增长,氨基酸水平的一般时间可逆矩阵变得越来越可行,将氨基酸的可交换性和频率视为自由参数。此外,基于多种矩阵混合物的模型也越来越多地被利用,以解释蛋白质中氨基酸需求的跨位点异质性。这些模型存在有限的经验衍生的氨基酸特征(或频率)混合物、自由有限混合物以及自由 Dirichlet 过程基于无限混合物。所有这些方法通常都与伽马分布的速率-站点模型相结合。尽管有这些不同的方面来模拟氨基酸替换过程,但没有研究系统地量化它们对真实数据预测能力的相对贡献。在这里,我们使用贝叶斯交叉验证来建立一个详细的比较,同时激活/停用每个建模方面。对于我们研究的大多数数据集,我们发现氨基酸混合物模型可以优于所有单矩阵模型,即使后者包括伽马分布的速率,而前者不包括。我们还发现自由有限混合物始终优于经验有限混合物。最后,基于 Dirichlet 过程的混合物模型往往优于其他所有方法。