Weinreich Jan, Browning Nicholas J, von Lilienfeld O Anatole
University of Vienna, Faculty of Physics, Kolingasse 14-16, AT-1090 Wien, Austria.
Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland.
J Chem Phys. 2021 Apr 7;154(13):134113. doi: 10.1063/5.0041548.
Free energies govern the behavior of soft and liquid matter, and improving their predictions could have a large impact on the development of drugs, electrolytes, or homogeneous catalysts. Unfortunately, it is challenging to devise an accurate description of effects governing solvation such as hydrogen-bonding, van der Waals interactions, or conformational sampling. We present a Free energy Machine Learning (FML) model applicable throughout chemical compound space and based on a representation that employs Boltzmann averages to account for an approximated sampling of configurational space. Using the FreeSolv database, FML's out-of-sample prediction errors of experimental hydration free energies decay systematically with training set size, and experimental uncertainty (0.6 kcal/mol) is reached after training on 490 molecules (80% of FreeSolv). Corresponding FML model errors are on par with state-of-the art physics based approaches. To generate the input representation for a new query compound, FML requires approximate and short molecular dynamics runs. We showcase its usefulness through analysis of solvation free energies for 116k organic molecules (all force-field compatible molecules in the QM9 database), identifying the most and least solvated systems and rediscovering quasi-linear structure-property relationships in terms of simple descriptors such as hydrogen-bond donors, number of NH or OH groups, number of oxygen atoms in hydrocarbons, and number of heavy atoms. FML's accuracy is maximal when the temperature used for the molecular dynamics simulation to generate averaged input representation samples in training is the same as for the query compounds. The sampling time for the representation converges rapidly with respect to the prediction error.
自由能支配着软物质和液态物质的行为,改进对它们的预测可能会对药物、电解质或均相催化剂的开发产生重大影响。不幸的是,要准确描述诸如氢键、范德华相互作用或构象采样等影响溶剂化的效应具有挑战性。我们提出了一种自由能机器学习(FML)模型,该模型适用于整个化合物空间,并基于一种采用玻尔兹曼平均值来考虑构型空间近似采样的表示方法。使用FreeSolv数据库,FML对实验水合自由能的样本外预测误差会随着训练集大小而系统地衰减,在对490个分子(FreeSolv的80%)进行训练后达到实验不确定性(0.6千卡/摩尔)。相应的FML模型误差与基于最先进物理学的方法相当。为了生成新查询化合物的输入表示,FML需要进行近似且简短的分子动力学运行。我们通过分析11.6万个有机分子(QM9数据库中所有与力场兼容的分子)的溶剂化自由能来展示其有用性,识别出溶剂化程度最高和最低的系统,并根据诸如氢键供体、NH或OH基团数量、烃中氧原子数量以及重原子数量等简单描述符重新发现准线性结构-性质关系。当用于生成训练中平均输入表示样本的分子动力学模拟温度与查询化合物的温度相同时,FML的准确性最高。表示的采样时间相对于预测误差迅速收敛。