Machine Learning Research, Bayer A.G., Berlin 13353, Germany.
CONCEPT Lab, Istituto Italiano di Tecnologia (IIT), Via Melen 83, B Block, Genoa 16152, Italy.
J Chem Theory Comput. 2022 Aug 9;18(8):5068-5078. doi: 10.1021/acs.jctc.2c00308. Epub 2022 Jul 15.
Existing computational methods for estimating p values in proteins rely on theoretical approximations and lengthy computations. In this work, we use a data set of 6 million theoretically determined p shifts to train deep learning models, which are shown to rival the physics-based predictors. These neural networks managed to infer the electrostatic contributions of different chemical groups and learned the importance of solvent exposure and close interactions, including hydrogen bonds. Although trained only using theoretical data, our pKAI+ model displayed the best accuracy in a test set of ∼750 experimental values. Inference times allow speedups of more than 1000× compared to physics-based methods. By combining speed, accuracy, and a reasonable understanding of the underlying physics, our models provide a game-changing solution for fast estimations of macroscopic p values from ensembles of microscopic values as well as for many downstream applications such as molecular docking and constant-pH molecular dynamics simulations.
现有的蛋白质 p 值估算计算方法依赖于理论近似和冗长的计算。在这项工作中,我们使用了一个包含 600 万种理论上确定的 p 值位移的数据集来训练深度学习模型,结果表明这些模型可以与基于物理的预测器相媲美。这些神经网络成功地推断出了不同化学基团的静电贡献,并学习了溶剂暴露和紧密相互作用(包括氢键)的重要性。尽管仅使用理论数据进行训练,但我们的 pKAI+ 模型在包含约 750 个实验值的测试集中显示了最佳的准确性。推理时间与基于物理的方法相比,速度提高了 1000 倍以上。通过结合速度、准确性和对底层物理的合理理解,我们的模型为从微观值集合快速估算宏观 p 值以及许多下游应用(如分子对接和恒 pH 分子动力学模拟)提供了一个改变游戏规则的解决方案。