Fan Ya Ju, Allen Jonathan E, McLoughlin Kevin S, Shi Da, Bennion Brian J, Zhang Xiaohua, Lightstone Felice C
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 7000 East Ave., Livermore, CA, USA.
Biological Science and Security Center, Lawrence Livermore National Laboratory, Livermore, CA, USA.
Artif Intell Chem. 2023 Jun;1(1). doi: 10.1016/j.aichem.2023.100004. Epub 2023 Jun 3.
Neural Network (NN) models provide potential to speed up the drug discovery process and reduce its failure rates. The success of NN models requires uncertainty quantification (UQ) as drug discovery explores chemical space beyond the training data distribution. Standard NN models do not provide uncertainty information. Some methods require changing the NN architecture or training procedure, limiting the selection of NN models. Moreover, predictive uncertainty can come from different sources. It is important to have the ability to separately model different types of predictive uncertainty, as the model can take assorted actions depending on the source of uncertainty. In this paper, we examine UQ methods that estimate different sources of predictive uncertainty for NN models aiming at protein-ligand binding prediction. We use our prior knowledge on chemical compounds to design the experiments. By utilizing a visualization method we create non-overlapping and chemically diverse partitions from a collection of chemical compounds. These partitions are used as training and test set splits to explore NN model uncertainty. We demonstrate how the uncertainties estimated by the selected methods describe different sources of uncertainty under different partitions and featurization schemes and the relationship to prediction error.
神经网络(NN)模型为加速药物发现过程和降低失败率提供了潜力。由于药物发现探索的化学空间超出了训练数据分布范围,NN模型的成功需要不确定性量化(UQ)。标准的NN模型不提供不确定性信息。一些方法需要改变NN架构或训练过程,限制了NN模型的选择。此外,预测不确定性可能来自不同来源。能够分别对不同类型的预测不确定性进行建模很重要,因为模型可以根据不确定性的来源采取不同的行动。在本文中,我们研究了针对蛋白质-配体结合预测的NN模型估计不同预测不确定性来源的UQ方法。我们利用对化合物的先验知识来设计实验。通过使用一种可视化方法,我们从一组化合物中创建了不重叠且化学性质不同的分区。这些分区用作训练集和测试集划分,以探索NN模型的不确定性。我们展示了所选方法估计的不确定性如何描述不同分区和特征化方案下的不同不确定性来源以及与预测误差的关系。