Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany.
PLoS Comput Biol. 2018 Nov 5;14(11):e1006526. doi: 10.1371/journal.pcbi.1006526. eCollection 2018 Nov.
Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny.
物理接触的蛋白质残基之间的补偿突变可以表现为蛋白质家族多重序列比对(MSA)中相应列之间的统计耦合。相反,大的耦合系数预测残基接触。基于这种方法的从头蛋白质结构预测方法变得越来越可靠。它们的主要限制是耦合系数估计中的系统和统计噪声很强,这迄今为止限制了它们在非常大的蛋白质家族中的应用。虽然大多数研究都集中在通过添加外部信息来改进预测,但在改进核心统计过程方面进展甚微,因为我们对噪声源的了解不足构成了主要障碍。首先,我们从理论上表明,假设没有耦合的耦合得分的期望值与列熵的平方根的乘积成正比,并且我们提出了一种简单的熵偏差校正(EntC)来减去该期望值。其次,我们表明,平均乘积校正(APC)包括对熵偏差的校正,这部分解释了其成功的原因。第三,我们开发了 CCMgen,这是第一个用于模拟蛋白质进化并生成具有成对统计残基耦合的现实合成 MSA 的方法。第四,为了学习可靠地再现观察到的对齐统计数据的精确统计模型,我们开发了 CCMpredPy,这是持久对比散度(PCD)方法的实现,用于精确推断。第五,我们通过分析来自系统发育和熵的系统噪声贡献来演示 CCMgen 和 CCMpredPy 如何促进接触预测方法的发展。使用熵偏差校正,我们可以分离这两个噪声源,并发现熵贡献的噪声大致是系统发育的两倍。