Gao Peiyuan, Yang Xiu, Tang Yu-Hang, Zheng Muqing, Andersen Amity, Murugesan Vijayakumar, Hollas Aaron, Wang Wei
Pacific Northwest National Laboratory, Richland 99352, USA.
Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015, USA.
Phys Chem Chem Phys. 2021 Nov 10;23(43):24892-24904. doi: 10.1039/d1cp04475c.
The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, p and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using the Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model for electrostatic interaction, the nonpolar interaction contribution of the solvent and the conformational entropy of the solute in the solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of the conformational entropy of the solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal mol for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.
有机分子的溶剂化自由能是决定诸如溶解度、液相平衡常数、有机氧化还原液流电池中的p和氧化还原电位等涌现性质的关键参数。在这项工作中,我们提出了一种机器学习(ML)模型,该模型可以基于一种新的分子图核,使用高斯过程回归方法来学习和预测有机分子的水合溶剂化自由能。为了研究ML模型在溶剂化自由能中静电相互作用、溶剂的非极性相互作用贡献和溶质的构象熵方面的性能,测试了具有隐式或显式水溶剂模型以及溶质构象熵贡献的三个数据集。我们证明,对于QM9数据集和Freesolv数据库的子集,我们的ML模型能够以小于1 kcal/mol的平均绝对误差在化学精度上预测分子的溶剂化自由能。为了解决基于图的ML模型普遍存在的数据稀缺问题,我们提出了一种基于分子图之间距离的降维算法,该算法可用于检查分子数据集的多样性。它为构建最小训练集提供了一种有前景的方法,以改善对分子结构空间已预先确定的某些测试集的预测。