Bass Lewis, Elder Luke H, Folescu Dan E, Forouzesh Negin, Tolokh Igor S, Karpatne Anuj, Onufriev Alexey V
Department of Computer Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States.
Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States.
J Chem Theory Comput. 2024 Jan 9;20(1):396-410. doi: 10.1021/acs.jctc.3c00981. Epub 2023 Dec 27.
The accuracy of computational models of water is key to atomistic simulations of biomolecules. We propose a computationally efficient way to improve the accuracy of the prediction of hydration-free energies (HFEs) of small molecules: the remaining errors of the physics-based models relative to the experiment are predicted and mitigated by machine learning (ML) as a postprocessing step. Specifically, the trained graph convolutional neural network attempts to identify the "blind spots" in the physics-based model predictions, where the complex physics of aqueous solvation is poorly accounted for, and partially corrects for them. The strategy is explored for five classical solvent models representing various accuracy/speed trade-offs, from the fast analytical generalized Born (GB) to the popular TIP3P explicit solvent model; experimental HFEs of small neutral molecules from the FreeSolv set are used for the training and testing. For all of the models, the ML correction reduces the resulting root-mean-square error relative to the experiment for HFEs of small molecules, without significant overfitting and with negligible computational overhead. For example, on the test set, the relative accuracy improvement is 47% for the fast analytical GB, making it, after the ML correction, almost as accurate as uncorrected TIP3P. For the TIP3P model, the accuracy improvement is about 39%, bringing the ML-corrected model's accuracy below the 1 kcal/mol threshold. In general, the relative benefit of the ML corrections is smaller for more accurate physics-based models, reaching the lower limit of about 20% relative accuracy gain compared with that of the physics-based treatment alone. The proposed strategy of using ML to learn the remaining error of physics-based models offers a distinct advantage over training ML alone directly on reference HFEs: it preserves the correct overall trend, even well outside of the training set.
水的计算模型的准确性是生物分子原子模拟的关键。我们提出了一种计算效率高的方法来提高小分子水化自由能(HFE)预测的准确性:通过机器学习(ML)作为后处理步骤来预测和减轻基于物理的模型相对于实验的剩余误差。具体来说,经过训练的图卷积神经网络试图识别基于物理的模型预测中的“盲点”,即水溶剂化的复杂物理过程未得到充分考虑的地方,并对其进行部分校正。我们针对代表各种准确性/速度权衡的五种经典溶剂模型探索了该策略,从快速解析广义玻恩(GB)模型到流行的TIP3P显式溶剂模型;使用来自FreeSolv集的小中性分子的实验HFE进行训练和测试。对于所有模型,ML校正降低了小分子HFE相对于实验的均方根误差,没有明显的过拟合,并且计算开销可以忽略不计。例如,在测试集上,快速解析GB模型的相对准确性提高了47%,经过ML校正后,其准确性几乎与未校正的TIP3P模型一样。对于TIP3P模型,准确性提高了约39%,使ML校正后的模型准确性低于1 kcal/mol阈值。一般来说,对于更准确的基于物理的模型,ML校正的相对益处较小,与仅基于物理的处理相比,相对准确性增益的下限约为20%。所提出的使用ML来学习基于物理的模型的剩余误差的策略相对于直接在参考HFE上单独训练ML具有明显优势:即使在训练集之外,它也能保持正确的总体趋势。