Betancourt Marcos R
Department of Physics, Indiana University Purdue University Indianapolis, Indianapolis, Indiana 46202, USA.
Proteins. 2009 Jul;76(1):72-85. doi: 10.1002/prot.22320.
Protein knowledge-based potentials are effective free energies obtained from databases of known protein structures. They are used to parameterize coarse-grained protein models in many folding simulation and structure prediction methods. Two common approaches are used in the derivation of knowledge-based potentials. One assumes that the energy parameters optimize the native structure stability. The other assumes that interaction events are related to their energies according to the Boltzmann distribution, and that they are distributed independently of other events, that is, the quasi-chemical approximation. Here, these assumptions are systematically tested by extracting contact energies from artificial databases of lattice proteins with predefined pairwise contact energies. Databases of protein sequences are designed to either satisfy the Boltzmann distribution at high or low temperatures, or to simultaneously optimize the native stability and folding kinetics. It is found that the quasi-chemical approximation, with the ideal reference state, accurately reproduce the true energies for high temperature Boltzmann distributed sequences (weakly interacting residues), but less accurately at low temperatures, where the sequences correspond to energy minima and the residues are strongly interacting. To overcome this problem, an iterative procedure for Boltzmann distributed sequences is introduced, which accounts for interacting residue correlations and eliminates the need for the quasi-chemical approximation. In this case, the energies are accurately reproduced at any ensemble temperature. However, when the database of sequences designed for optimal stability and kinetics is used, the energy correlation is less than optimal using either method, exhibiting random and systematic deviations from linearity. Therefore, the assumption that native structures are maximally stable or that sequences are determined according to the Boltzmann distribution seems to be inadequate for obtaining accurate energies. The limited number of sequences in the database and the inhomogeneous concentration of amino acids from one structure to another do not seem to be major obstacles for improving the quality of the extracted pairwise energies, with the exception of repulsive interactions.
基于蛋白质知识的势能是从已知蛋白质结构数据库中获得的有效自由能。它们被用于许多折叠模拟和结构预测方法中对粗粒度蛋白质模型进行参数化。在基于知识的势能推导中使用了两种常见方法。一种方法假设能量参数可优化天然结构稳定性。另一种方法假设相互作用事件根据玻尔兹曼分布与其能量相关,并且它们独立于其他事件分布,即准化学近似。在此,通过从具有预定义成对接触能的晶格蛋白质人工数据库中提取接触能,对这些假设进行了系统测试。设计蛋白质序列数据库以满足高温或低温下的玻尔兹曼分布,或者同时优化天然稳定性和折叠动力学。结果发现,具有理想参考态的准化学近似能准确再现高温玻尔兹曼分布序列(弱相互作用残基)的真实能量,但在低温下准确性较低,此时序列对应于能量最小值且残基强烈相互作用。为克服此问题,引入了一种针对玻尔兹曼分布序列的迭代程序,该程序考虑了相互作用残基的相关性并消除了对准化学近似的需求。在这种情况下,在任何系综温度下都能准确再现能量。然而,当使用为优化稳定性和动力学而设计的序列数据库时,无论使用哪种方法,能量相关性都未达到最佳,表现出与线性的随机和系统偏差。因此,天然结构是最大程度稳定或序列根据玻尔兹曼分布确定的假设似乎不足以获得准确的能量。数据库中序列数量有限以及从一个结构到另一个结构氨基酸浓度不均匀,除了排斥相互作用外,似乎并不是提高提取的成对能量质量的主要障碍。