Laboratory of Computational Science and Modeling, IMX , École Polytechnique Fédérale de Lausanne , 1015 Lausanne , Switzerland.
Machine Learning & Optimization Laboratory, IC , École Polytechnique Fédérale de Lausanne , 1015 Lausanne , Switzerland.
J Chem Theory Comput. 2019 Feb 12;15(2):906-915. doi: 10.1021/acs.jctc.8b00959. Epub 2019 Jan 18.
We present a scheme to obtain an inexpensive and reliable estimate of the uncertainty associated with the predictions of a machine-learning model of atomic and molecular properties. The scheme is based on resampling, with multiple models being generated based on subsampling of the same training data. The accuracy of the uncertainty prediction can be benchmarked by maximum likelihood estimation, which can also be used to correct for correlations between resampled models and to improve the performance of the uncertainty estimation by a cross-validation procedure. In the case of sparse Gaussian Process Regression models, this resampled estimator can be evaluated at negligible cost. We demonstrate the reliability of these estimates for the prediction of molecular and materials energetics and for the estimation of nuclear chemical shieldings in molecular crystals. Extension to estimate the uncertainty in energy differences, forces, or other correlated predictions is straightforward. This method can be easily applied to other machine-learning schemes and will be beneficial to make data-driven predictions more reliable and to facilitate training-set optimization and active-learning strategies.
我们提出了一种方案,以获得对机器学习模型预测原子和分子性质相关不确定性的廉价且可靠的估计。该方案基于重采样,通过对相同训练数据进行子采样生成多个模型。不确定性预测的准确性可以通过最大似然估计进行基准测试,该估计也可以用于校正重采样模型之间的相关性,并通过交叉验证过程提高不确定性估计的性能。在稀疏高斯过程回归模型的情况下,可以以可忽略的成本评估此重采样估计器。我们证明了这些估计值在预测分子和材料的能量以及估算分子晶体中的核化学屏蔽方面的可靠性。扩展到估计能量差、力或其他相关预测的不确定性是很简单的。该方法可以轻松应用于其他机器学习方案,并将有助于使数据驱动的预测更加可靠,并促进训练集优化和主动学习策略。