Department of Electrical Engineering and Computer Science, University of California, Berkeley, 94720, USA.
Proteins. 2010 May 1;78(6):1583-93. doi: 10.1002/prot.22677.
De novo protein structure prediction requires location of the lowest energy state of the polypeptide chain among a vast set of possible conformations. Powerful approaches include conformational space annealing, in which search progressively focuses on the most promising regions of conformational space, and genetic algorithms, in which features of the best conformations thus far identified are recombined. We describe a new approach that combines the strengths of these two approaches. Protein conformations are projected onto a discrete feature space which includes backbone torsion angles, secondary structure, and beta pairings. For each of these there is one "native" value: the one found in the native structure. We begin with a large number of conformations generated in independent Monte Carlo structure prediction trajectories from Rosetta. Native values for each feature are predicted from the frequencies of feature value occurrences and the energy distribution in conformations containing them. A second round of structure prediction trajectories are then guided by the predicted native feature distributions. We show that native features can be predicted at much higher than background rates, and that using the predicted feature distributions improves structure prediction in a benchmark of 28 proteins. The advantages of our approach are that features from many different input structures can be combined simultaneously without producing atomic clashes or otherwise physically inviable models, and that the features being recombined have a relatively high chance of being correct.
从头蛋白质结构预测需要在大量可能构象中定位多肽链的最低能量状态。强大的方法包括构象空间退火,其中搜索逐渐集中在构象空间最有前途的区域,以及遗传算法,其中迄今为止确定的最佳构象的特征被重新组合。我们描述了一种结合这两种方法优点的新方法。蛋白质构象被投影到一个离散的特征空间上,该空间包括骨架扭转角、二级结构和β配对。对于每一个特征,都有一个“天然”值:在天然结构中发现的那个值。我们从 Rosetta 的独立 Monte Carlo 结构预测轨迹开始,生成了大量构象。每个特征的天然值是根据特征值出现的频率和包含它们的构象中的能量分布来预测的。然后,第二轮结构预测轨迹由预测的天然特征分布指导。我们表明,可以以远高于背景的速率预测天然特征,并且使用预测的特征分布可以提高 28 个蛋白质基准测试中的结构预测。我们方法的优点是可以同时组合来自许多不同输入结构的特征,而不会产生原子冲突或以其他方式不可行的模型,并且正在重新组合的特征具有相对较高的正确性机会。