Faculty of Chemistry, University of Gdańsk , ul. Wita Stwosza 63, 80-308 Gdańsk, Poland.
Laboratory of Biopolymer Structure, Intercollegiate Faculty of Biotechnology, University of Gdańsk and Medical University of Gdańsk , Kładki 24, 80-922 Gdańsk, Poland.
J Chem Inf Model. 2015 Sep 28;55(9):2050-70. doi: 10.1021/acs.jcim.5b00395. Epub 2015 Aug 20.
A new approach to the calibration of the force fields is proposed, in which the force-field parameters are obtained by maximum-likelihood fitting of the calculated conformational ensembles to the experimental ensembles of training system(s). The maximum-likelihood function is composed of logarithms of the Boltzmann probabilities of the experimental conformations, calculated with the current energy function. Because the theoretical distribution is given in the form of the simulated conformations only, the contributions from all of the simulated conformations, with Gaussian weights in the distances from a given experimental conformation, are added to give the contribution to the target function from this conformation. In contrast to earlier methods for force-field calibration, the approach does not suffer from the arbitrariness of dividing the decoy set into native-like and non-native structures; however, if such a division is made instead of using Gaussian weights, application of the maximum-likelihood method results in the well-known energy-gap maximization. The computational procedure consists of cycles of decoy generation and maximum-likelihood-function optimization, which are iterated until convergence is reached. The method was tested with Gaussian distributions and then applied to the physics-based coarse-grained UNRES force field for proteins. The NMR structures of the tryptophan cage, a small α-helical protein, determined at three temperatures (T = 280, 305, and 313 K) by Hałabis et al. ( J. Phys. Chem. B 2012 , 116 , 6898 - 6907 ), were used. Multiplexed replica-exchange molecular dynamics was used to generate the decoys. The iterative procedure exhibited steady convergence. Three variants of optimization were tried: optimization of the energy-term weights alone and use of the experimental ensemble of the folded protein only at T = 280 K (run 1); optimization of the energy-term weights and use of experimental ensembles at all three temperatures (run 2); and optimization of the energy-term weights and the coefficients of the torsional and multibody energy terms and use of experimental ensembles at all three temperatures (run 3). The force fields were subsequently tested with a set of 14 α-helical and two α + β proteins. Optimization run 1 resulted in better agreement with the experimental ensemble at T = 280 K compared with optimization run 2 and in comparable performance on the test set but poorer agreement of the calculated folding temperature with the experimental folding temperature. Optimization run 3 resulted in the best fit of the calculated ensembles to the experimental ones for the tryptophan cage but in much poorer performance on the training set, suggesting that use of a small α-helical protein for extensive force-field calibration resulted in overfitting of the data for this protein at the expense of transferability. The optimized force field resulting from run 2 was found to fold 13 of the 14 tested α-helical proteins and one small α + β protein with the correct topologies; the average structures of 10 of them were predicted with accuracies of about 5 Å C(α) root-mean-square deviation or better. Test simulations with an additional set of 12 α-helical proteins demonstrated that this force field performed better on α-helical proteins than the previous parametrizations of UNRES. The proposed approach is applicable to any problem of maximum-likelihood parameter estimation when the contributions to the maximum-likelihood function cannot be evaluated at the experimental points and the dimension of the configurational space is too high to construct histograms of the experimental distributions.
提出了一种新的力场校准方法,其中通过最大似然拟合计算构象集合与训练系统(多个)的实验集合,获得力场参数。最大似然函数由当前能量函数计算的实验构象的玻尔兹曼概率的对数组成。由于理论分布仅以模拟构象的形式给出,因此,从给定的实验构象以高斯权重的距离添加所有模拟构象的贡献,以给出来自该构象的目标函数的贡献。与以前的力场校准方法不同,该方法不受将诱饵集划分为天然样和非天然样结构的任意性的影响;然而,如果进行这样的划分而不是使用高斯权重,则应用最大似然方法会导致众所周知的能量间隙最大化。计算过程由诱饵生成和最大似然函数优化的循环组成,这些循环迭代直到达到收敛。该方法使用高斯分布进行了测试,然后应用于基于物理的粗粒度 UNRES 力场进行蛋白质。使用了 Hałabis 等人(J. Phys. Chem. B 2012, 116, 6898-6907)在三个温度(T = 280、305 和 313 K)下测定的色氨酸笼(一种小的α-螺旋蛋白)的 NMR 结构。使用复用交换分子动力学生成诱饵。迭代过程表现出稳定的收敛性。尝试了三种优化变体:仅优化能量项权重的优化和仅在 T = 280 K 时使用折叠蛋白的实验集合的优化(运行 1);优化能量项权重并在所有三个温度下使用实验集合的优化(运行 2);以及优化能量项权重和扭转和多体能量项的系数,并在所有三个温度下使用实验集合的优化(运行 3)。随后使用一组 14 个α-螺旋蛋白和两个α+β蛋白对力场进行了测试。与运行 2 相比,运行 1 导致与 T = 280 K 时的实验集合更好的一致性,并且在测试集上具有可比的性能,但计算折叠温度与实验折叠温度的一致性较差。运行 3 导致色氨酸笼的计算集合与实验集合的最佳拟合,但在训练集上的性能差得多,这表明使用小的α-螺旋蛋白进行广泛的力场校准会导致数据过度拟合,而牺牲了可转移性。发现来自运行 2 的优化力场可以折叠 14 个测试的α-螺旋蛋白和一个小的α+β蛋白中的 13 个,具有正确的拓扑结构;其中 10 个的平均结构的预测精度约为 5 Å C(α)均方根偏差或更好。使用额外的 12 个α-螺旋蛋白的测试模拟表明,与 UNRES 的先前参数化相比,该力场在α-螺旋蛋白上的性能更好。所提出的方法适用于任何最大似然参数估计问题,当无法在实验点评估最大似然函数的贡献,并且构象空间的维度太高而无法构建实验分布的直方图时。