Arenas Miguel, Sánchez-Cobos Agustin, Bastolla Ugo
Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain.
Department of Cell Biology and Immunology, Centro de Biología Molecular Severo Ochoa (CSIC-UAM), Universidad Autónoma de Madrid, Madrid, Spain
Mol Biol Evol. 2015 Aug;32(8):2195-207. doi: 10.1093/molbev/msv085. Epub 2015 Apr 2.
Despite intense work, incorporating constraints on protein native structures into the mathematical models of molecular evolution remains difficult, because most models and programs assume that protein sites evolve independently, whereas protein stability is maintained by interactions between sites. Here, we address this problem by developing a new mean-field substitution model that generates independent site-specific amino acid distributions with constraints on the stability of the native state against both unfolding and misfolding. The model depends on a background distribution of amino acids and one selection parameter that we fix maximizing the likelihood of the observed protein sequence. The analytic solution of the model shows that the main determinant of the site-specific distributions is the number of native contacts of the site and that the most variable sites are those with an intermediate number of native contacts. The mean-field models obtained, taking into account misfolded conformations, yield larger likelihood than models that only consider the native state, because their average hydrophobicity is more realistic, and they produce on the average stable sequences for most proteins. We evaluated the mean-field model with respect to empirical substitution models on 12 test data sets of different protein families. In all cases, the observed site-specific sequence profiles presented smaller Kullback-Leibler divergence from the mean-field distributions than from the empirical substitution model. Next, we obtained substitution rates combining the mean-field frequencies with an empirical substitution model. The resulting mean-field substitution model assigns larger likelihood than the empirical model to all studied families when we consider sequences with identity larger than 0.35, plausibly a condition that enforces conservation of the native structure across the family. We found that the mean-field model performs better than other structurally constrained models with similar or higher complexity. With respect to the much more complex model recently developed by Bordner and Mittelmann, which takes into account pairwise terms in the amino acid distributions and also optimizes the exchangeability matrix, our model performed worse for data with small sequence divergence but better for data with larger sequence divergence. The mean-field model has been implemented into the computer program Prot_Evol that is freely available at http://ub.cbm.uam.es/software/Prot_Evol.php.
尽管付出了巨大努力,但将蛋白质天然结构的限制因素纳入分子进化的数学模型仍然困难重重,因为大多数模型和程序都假定蛋白质位点是独立进化的,然而蛋白质的稳定性是由位点之间的相互作用维持的。在此,我们通过开发一种新的平均场替代模型来解决这一问题,该模型在考虑天然状态对解折叠和错误折叠稳定性的限制条件下,生成独立的位点特异性氨基酸分布。该模型依赖于氨基酸的背景分布和一个选择参数,我们通过最大化观察到的蛋白质序列的似然性来确定该参数。模型的解析解表明,位点特异性分布的主要决定因素是该位点的天然接触数,且变化最大的位点是那些具有中等数量天然接触的位点。考虑到错误折叠构象而得到的平均场模型,比仅考虑天然状态的模型具有更大的似然性,因为其平均疏水性更符合实际情况,并且它们平均能产生大多数蛋白质的稳定序列。我们在12个不同蛋白质家族的测试数据集上,针对经验替代模型评估了平均场模型。在所有情况下,观察到的位点特异性序列谱与平均场分布的库尔贝克-莱布勒散度,都小于与经验替代模型的散度。接下来,我们将平均场频率与经验替代模型相结合得到了替代率。当我们考虑序列一致性大于0.35的序列时,由此得到的平均场替代模型对所有研究家族赋予的似然性都大于经验模型,这可能是一个强制家族内天然结构保守的条件。我们发现,平均场模型比其他具有相似或更高复杂度的结构受限模型表现更好。对于Bordner和Mittelmann最近开发的更为复杂的模型,该模型考虑了氨基酸分布中的成对项并优化了交换性矩阵,我们的模型在序列差异较小的数据上表现较差,但在序列差异较大的数据上表现较好。平均场模型已被实现为计算机程序Prot_Evol,可在http://ub.cbm.uam.es/software/Prot_Evol.php免费获取。