Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America.
PLoS One. 2011;6(12):e28766. doi: 10.1371/journal.pone.0028766. Epub 2011 Dec 7.
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing.In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy.We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues, including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7-4.8 Å C(α)-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
蛋白质在序列空间中的进化轨迹受到其功能的限制。序列同源物的集合记录了数以百万计的进化实验的结果,在这些实验中,蛋白质根据这些约束进化。破译这些序列中保存的进化记录并将其用于预测和工程目的是一项艰巨的挑战。由于高通量基因组测序成本低廉,解决这一挑战的潜在好处得到了放大。
在本文中,我们探讨了是否可以从蛋白质的序列同源物集合中推断出进化约束。挑战在于从观察到的相关联噪声集中区分真正的共进化耦合。我们使用受多序列比对统计信息约束的蛋白质序列最大熵模型来解决这个问题,以推断残基对耦合。令人惊讶的是,我们发现这些推断的耦合强度是折叠结构中残基-残基接近度的极好预测指标。实际上,得分最高的残基耦合足够准确且分布均匀,可以以惊人的精度定义 3D 蛋白质折叠。
我们通过从序列本身计算十五种不同折叠类别的测试蛋白的全原子 3D 结构,来量化这一观察结果,这些蛋白的大小从 50 到 260 个残基不等,包括 G 蛋白偶联受体。这些盲目的推断是从头开始的,也就是说,它们不使用同源建模或来自已知结构的序列相似片段。共进化信号提供了足够的信息,可以确定准确的 3D 蛋白质结构,相对于观察到的结构,Cα-RMSD 误差为 2.7-4.8 Å,至少覆盖蛋白质的三分之二(称为 EVfold 的方法,详细信息请访问 http://EVfold.org)。这一发现深入了解了限制蛋白质进化的基本相互作用,并将有助于全面调查蛋白质结构的宇宙,蛋白质和药物设计的新策略,以及正常和疾病基因组中功能遗传变异的识别。