Grambow Colin A, Li Yi-Pei, Green William H
Department of Chemical Engineering , Massachusetts Institute of Technology , Cambridge , Massachusetts 02139 , United States.
J Phys Chem A. 2019 Jul 11;123(27):5826-5835. doi: 10.1021/acs.jpca.9b04195. Epub 2019 Jun 27.
Machine learning provides promising new methods for accurate yet rapid prediction of molecular properties, including thermochemistry, which is an integral component of many computer simulations, particularly automated reaction mechanism generation. Often, very large data sets with tens of thousands of molecules are required for training the models, but most data sets of experimental or high-accuracy quantum mechanical quality are much smaller. To overcome these limitations, we calculate new high-level data sets and derive bond additivity corrections to significantly improve enthalpies of formation. We adopt a transfer learning technique to train neural network models that achieve good performance even with a relatively small set of high-accuracy data. The training data for the entropy model are carefully selected so that important conformational effects are captured. The resulting models are generally applicable thermochemistry predictors for organic compounds with oxygen and nitrogen heteroatoms that approach experimental and coupled cluster accuracy while only requiring molecular graph inputs. Due to their versatility and the ease of adding new training data, they are poised to replace conventional estimation methods for thermochemical parameters in reaction mechanism generation. Since high-accuracy data are often sparse, similar transfer learning approaches are expected to be useful for estimating many other molecular properties.
机器学习为准确且快速地预测分子性质提供了很有前景的新方法,这些性质包括热化学性质,而热化学是许多计算机模拟(尤其是自动反应机理生成)不可或缺的组成部分。通常,训练模型需要包含数万个分子的非常大的数据集,但大多数具有实验或高精度量子力学质量的数据集要小得多。为了克服这些限制,我们计算了新的高水平数据集并推导了键加和校正,以显著改善生成焓。我们采用迁移学习技术来训练神经网络模型,即使使用相对较少的高精度数据,这些模型也能取得良好的性能。熵模型的训练数据经过精心挑选,以便捕捉重要的构象效应。所得模型是适用于含有氧和氮杂原子的有机化合物的通用热化学预测器,其精度接近实验和耦合簇方法的精度,同时只需要分子图输入。由于它们的通用性以及添加新训练数据的便利性,它们有望在反应机理生成中取代传统的热化学参数估计方法。由于高精度数据往往很稀疏,类似的迁移学习方法预计对估计许多其他分子性质也有用。