Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology - UMR7238, 75005 Paris, France.
Mol Biol Evol. 2018 Apr 1;35(4):1018-1027. doi: 10.1093/molbev/msy007.
Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue-residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness effects of amino-acid substitutions in proteins, and to predict evolutionary conserved protein-protein interactions. These models are based on two currently unjustified hypotheses: 1) correlations in the amino-acid usage of different positions are resulting collectively from networks of direct couplings; and 2) pairwise couplings are sufficient to capture the amino-acid variability. Here, we propose a highly precise inference scheme based on Boltzmann-machine learning, which allows us to systematically address these hypotheses. We show how correlations are built up in a highly collective way by a large number of coupling paths, which are based on the proteins three-dimensional structure. We further find that pairwise coevolutionary models capture the collective residue variability across homologous proteins even for quantities which are not imposed by the inference procedure, like three-residue correlations, the clustered structure of protein families in sequence space or the sequence distances between homologs. These findings strongly suggest that pairwise coevolutionary models are actually sufficient to accurately capture the residue variability in homologous protein families.
全局同源蛋白家族的协同进化模型,通过直接耦联分析(DCA)构建,最近由于其仅从序列信息准确预测残基-残基接触的能力而特别受欢迎,从而有助于预测三级和四级蛋白质结构。最近,它们还被用于预测蛋白质中氨基酸取代的适应度效应,以及预测进化保守的蛋白质-蛋白质相互作用。这些模型基于两个目前没有根据的假设:1)不同位置的氨基酸使用相关性是由直接耦联网络共同产生的;2)成对耦联足以捕获氨基酸的可变性。在这里,我们提出了一种基于玻尔兹曼机学习的高度精确的推断方案,使我们能够系统地解决这些假设。我们展示了相关性是如何通过大量基于蛋白质三维结构的耦联路径以高度集体的方式建立起来的。我们进一步发现,即使对于推断过程没有施加的数量,如三残基相关性、序列空间中蛋白质家族的聚类结构或同源物之间的序列距离,成对共进化模型也能捕捉同源蛋白家族中的集体残基可变性。这些发现强烈表明,成对共进化模型实际上足以准确捕捉同源蛋白家族中的残基可变性。