Department of Biology, Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA.
Mol Biol Evol. 2018 Sep 1;35(9):2307-2317. doi: 10.1093/molbev/msy127.
The relative evolutionary rates at individual sites in proteins are informative measures of conservation or adaptation. Often used as evolutionarily aware conservation scores, relative rates reveal key functional or strongly selected residues. Estimating rates in a phylogenetic context requires specifying a protein substitution model, which is typically a phenomenological model trained on a large empirical data set. A strong emphasis has traditionally been placed on selecting the "best-fit" model, with the implicit understanding that suboptimal or otherwise ill-fitting models might bias inferences. However, the pervasiveness and degree of such bias has not been systematically examined. We investigated how model choice impacts site-wise relative rates in a large set of empirical protein alignments. We compared models designed for use on any general protein, models designed for specific domains of life, and the simple equal-rates Jukes Cantor-style model (JC). As expected, information theoretic measures showed overwhelming evidence that some models fit the data decidedly better than others. By contrast, estimates of site-specific evolutionary rates were impressively insensitive to the substitution model used, revealing an unexpected degree of robustness to potential model misspecification. A deeper examination of the fewer than 5% of sites for which model inferences differed in a meaningful way showed that the JC model could uniquely identify rapidly evolving sites that models with empirically derived exchangeabilities failed to detect. We conclude that relative protein rates appear robust to the applied substitution model, and any sensible model of protein evolution, regardless of its fit to the data, should produce broadly consistent evolutionary rates.
蛋白质中各个位点的相对进化速率是保守性或适应性的有价值的衡量标准。通常被用作进化意识的保守性评分,相对速率揭示了关键的功能或强烈选择的残基。在系统发育背景下估计速率需要指定蛋白质替换模型,该模型通常是基于大型经验数据集训练的现象学模型。传统上,人们强烈关注选择“最佳拟合”模型,其隐含的理解是,次优或其他不合适的模型可能会产生偏差。然而,这种偏差的普遍性和程度尚未得到系统的研究。我们研究了模型选择如何影响大量经验蛋白质比对中各个位点的相对速率。我们比较了适用于任何一般蛋白质的模型、为特定生命领域设计的模型,以及简单的均等速率 Jukes Cantor 风格模型 (JC)。正如预期的那样,信息论衡量标准表明,一些模型明显比其他模型更适合数据。相比之下,对特定位点的进化速率的估计对所使用的替换模型非常不敏感,这揭示了对潜在模型失配的令人惊讶的稳健性。对模型推断在有意义的方面存在差异的不到 5%的位点进行更深入的研究表明,JC 模型可以独特地识别快速进化的位点,而具有经验推导的可交换性的模型则无法检测到这些位点。我们得出结论,相对蛋白质速率对应用的替换模型具有稳健性,并且任何合理的蛋白质进化模型,无论其与数据的拟合程度如何,都应该产生广泛一致的进化速率。