Department of Chemistry , and Department of Computer Science , University of Toronto , Toronto , Ontario M5S , Canada.
Laboratory of Physical Chemistry , ETH Zurich , Vladimir-Prelog-Weg 2 , 8093 Zurich , Switzerland.
J Chem Theory Comput. 2019 Nov 12;15(11):6046-6060. doi: 10.1021/acs.jctc.9b00627. Epub 2019 Oct 11.
We employ Gaussian process (GP) regression to adjust for systematic errors in D3-type dispersion corrections. We refer to the associated, statistically improved model as D3-GP. It is trained on differences between interaction energies obtained from PBE-D3(BJ)/ma-def2-QZVPP and DLPNO-CCSD(T)/CBS calculations. We generated a data set containing interaction energies for 1248 molecular dimers, which resemble the dispersion-dominated systems contained in the S66 data set. Our systems represent not only equilibrium structures but also dimers with various relative orientations and conformations at both shorter and longer distances. A reparametrization of the D3(BJ) model based on 66 of these dimers suggests that two of its three empirical parameters, and , are zero, whereas = 5.6841 bohr. For the remaining 1182 dimers, we find that this new set of parameters is superior to all previously published D3(BJ) parameter sets. To train our D3-GP model, we engineered two different vectorial representations of (supra-)molecular systems, both derived from the matrix of atom-pairwise D3(BJ) interaction terms: (a) a distance-resolved interaction energy histogram, histD3(BJ), and (b) eigenvalues of the interaction matrix ordered according to their decreasing absolute value, eigD3(BJ). Hence, the GP learns a mapping from D3(BJ) information only, which renders D3-GP-type dispersion corrections comparable to those obtained with the original D3 approach. They improve systematically if the underlying training set is selected carefully. Here, we harness the prediction variance obtained from GP regression to select optimal training sets in an automated fashion. The larger the variance, the more information the corresponding data point may add to the training set. For a given set of molecular systems, variance-based sampling can approximately determine the smallest subset being subjected to reference calculations such that all dispersion corrections for the remaining systems fall below a predefined accuracy threshold. To render the entire D3-GP workflow as efficient as possible, we present an improvement over our variance-based, sequential active-learning scheme [ 2018 , 14 , 5238 ]. Our refined learning algorithm selects multiple (instead of single) systems that can be subjected to reference calculations simultaneously. We refer to the underlying selection strategy as batchwise variance-based sampling (BVS). BVS-guided active learning is an essential component of our D3-GP workflow, which is implemented in a black-box fashion. Once provided with reference data for new molecular systems, the underlying GP model automatically learns to adapt to these and similar systems. This approach leads overall to a self-improving model (D3-GP) that predicts system-focused and GP-refined D3-type dispersion corrections for any given system of reference data.
我们采用高斯过程(GP)回归来调整 D3 型色散校正中的系统误差。我们将相关的、经过统计学改进的模型称为 D3-GP。它是基于 PBE-D3(BJ)/ma-def2-QZVPP 和 DLPNO-CCSD(T)/CBS 计算得到的相互作用能之间的差异进行训练的。我们生成了一个包含 1248 个分子二聚体相互作用能的数据集,这些二聚体类似于 S66 数据集中包含的色散主导系统。我们的系统不仅代表了平衡结构,还代表了在较短和较长距离处具有各种相对取向和构象的二聚体。基于其中 66 个二聚体对 D3(BJ)模型进行的重新参数化表明,其三个经验参数中的两个, 和 ,为零,而 = 5.6841 bohr。对于其余的 1182 个二聚体,我们发现这个新的参数集优于所有以前发表的 D3(BJ)参数集。为了训练我们的 D3-GP 模型,我们设计了两种不同的超分子系统的向量表示形式,都源自原子间 D3(BJ)相互作用项的矩阵:(a)距离分辨的相互作用能直方图 histD3(BJ),以及(b)根据其绝对值递减的顺序排列的相互作用矩阵的特征值 eigD3(BJ)。因此,GP 仅从 D3(BJ)信息中学习映射,这使得 D3-GP 类型的色散校正与原始 D3 方法获得的校正相当。如果仔细选择基础训练集,它们会系统地改进。在这里,我们利用 GP 回归获得的预测方差以自动化的方式选择最佳训练集。方差越大,相应的数据点可能为训练集添加的信息就越多。对于给定的分子系统集,基于方差的采样可以大致确定要进行参考计算的最小子集,以便所有剩余系统的色散校正都低于预定义的精度阈值。为了使整个 D3-GP 工作流程尽可能高效,我们对基于方差的顺序主动学习方案[2018,14,5238]进行了改进。我们的改进学习算法选择多个(而不是单个)可以同时进行参考计算的系统。我们将这种基本的选择策略称为基于方差的批处理采样(BVS)。基于方差的引导主动学习是我们 D3-GP 工作流程的一个基本组成部分,它以黑盒方式实现。一旦为新的分子系统提供了参考数据,底层 GP 模型就会自动学习适应这些系统和类似系统。这种方法导致整体上的自我改进模型(D3-GP),可以为任何给定的参考数据系统预测聚焦于系统和经过 GP 改进的 D3 型色散校正。