Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Institute of Materials Chemistry, TU Wien, 1060 Vienna, Austria.
J Chem Inf Model. 2023 Jul 10;63(13):4012-4029. doi: 10.1021/acs.jcim.3c00373. Epub 2023 Jun 20.
Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predictions, where the diverse nature of target properties and the vast chemical chemical space give rise to many different distinct sources of prediction error. We demonstrate that different sources of error can each be significant in different contexts and must be individually addressed during model development. Through controlled experiments on data sets of molecular properties, we show important trends in model performance associated with the level of noise in the data set, size of the data set, model architecture, molecule representation, ensemble size, and data set splitting. In particular, we show that 1) noise in the test set can limit a model's observed performance when the actual performance is much better, 2) using size-extensive model aggregation structures is crucial for extensive property prediction, and 3) ensembling is a reliable tool for uncertainty quantification and improvement specifically for the contribution of model variance. We develop general guidelines on how to improve an underperforming model when falling into different uncertainty contexts.
在机器学习可靠性、鲁棒性、安全性和主动学习的背景下,最近人们对机器学习模型中的不确定性进行了研究。在这里,我们将总不确定性分为数据噪声(Aleatoric)和模型缺陷(Epistemic)的贡献,进一步将 Epistemic 不确定性分为模型偏差和方差的贡献。我们系统地研究了噪声、模型偏差和模型方差在化学性质预测中的影响,在化学性质预测中,目标性质的多样性和广阔的化学空间导致了许多不同的预测误差来源。我们证明,在不同的环境中,不同的误差源可能都很重要,并且在模型开发过程中必须单独处理。通过对分子性质数据集进行受控实验,我们展示了与数据集噪声水平、数据集大小、模型架构、分子表示、集合大小和数据集分割相关的模型性能的重要趋势。特别是,我们表明 1)当实际性能要好得多时,测试集中的噪声会限制模型的观察性能,2)使用大规模模型聚合结构对于广泛的属性预测至关重要,3)集成是一种可靠的不确定性量化和改进工具,特别是对于模型方差的贡献。我们提出了在落入不同不确定性环境时如何改进表现不佳的模型的一般指导原则。