Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
J Chem Phys. 2019 Jan 28;150(4):044107. doi: 10.1063/1.5078640.
Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. To apply the marginalized graph kernel, a spatial adjacency rule is first employed to convert molecules into graphs whose vertices and edges are labeled by elements and interatomic distances, respectively. We then derive formulas for the efficient evaluation of the kernel. Specific functional components for the marginalized graph kernel are proposed, while the effects of the associated hyperparameters on accuracy and predictive confidence are examined. We show that the graph kernel is particularly suitable for predicting extensive properties because its convolutional structure coincides with that of the covariance formula between sums of random variables. Using an active learning procedure, we demonstrate that the proposed method can achieve a mean absolute error of 0.62 ± 0.01 kcal/mol using as few as 2000 training samples on the QM7 dataset.
数据驱动的分子性质预测给机器学习方法的设计带来了独特的挑战,涉及数据结构/维度、对称适应和置信度管理。在本文中,我们提出了一个基于核的流水线,可以高精度地学习和预测分子的汽化能。该框架采用高斯过程回归根据分子之间的相似性进行预测,这是通过边缘化图核计算得出的。为了应用边缘化图核,首先采用空间邻接规则将分子转换为图,其中顶点和边分别由元素和原子间距离标记。然后,我们推导出了核高效评估的公式。提出了边缘化图核的特定功能组件,同时研究了相关超参数对准确性和预测置信度的影响。我们表明,图核特别适合预测广延性质,因为其卷积结构与随机变量和的协方差公式的卷积结构一致。通过使用主动学习过程,我们证明了该方法在 QM7 数据集上使用 2000 个训练样本即可实现 0.62±0.01 kcal/mol 的平均绝对误差。