Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom.
J Chem Inf Model. 2019 Jul 22;59(7):3330-3339. doi: 10.1021/acs.jcim.9b00297. Epub 2019 Jun 26.
While the use of deep learning in drug discovery is gaining increasing attention, the lack of methods to compute reliable errors in prediction for Neural Networks prevents their application to guide decision making in domains where identifying unreliable predictions is essential, e.g., precision medicine. Here, we present a framework to compute reliable errors in prediction for Neural Networks using Test-Time Dropout and Conformal Prediction. Specifically, the algorithm consists of training a Neural Network using dropout, and then to both the validation and test sets, also employing dropout in this step. Therefore, for each instance in the validation and test sets an ensemble of predictions are generated. The residuals and absolute errors in prediction for the validation set are then used to compute prediction errors for the test set instances using Conformal Prediction. We show using 24 bioactivity data sets from ChEMBL 23 that Dropout Conformal Predictors are valid (i.e., the fraction of instances whose true value lies within the predicted interval strongly correlates with the confidence level) and efficient, as the predicted confidence intervals span a narrower set of values than those computed with Conformal Predictors generated using Random Forest (RF) models. Lastly, we show in retrospective virtual screening experiments that dropout and RF-based Conformal Predictors lead to comparable retrieval rates of active compounds. Overall, we propose a computationally efficient framework (as only extra forward passes are required in addition to training a single network) to harness Test-Time Dropout and the Conformal Prediction framework, which is generally applicable to generate reliable prediction errors for Deep Neural Networks in drug discovery and beyond.
虽然深度学习在药物发现中的应用越来越受到关注,但缺乏计算神经网络预测可靠误差的方法,这阻碍了它们在需要识别不可靠预测的领域(如精准医学)中应用于指导决策。在这里,我们提出了一种使用测试时随机失活和一致性预测来计算神经网络预测可靠误差的框架。具体来说,该算法包括使用随机失活训练神经网络,然后对验证集和测试集都使用随机失活。因此,对于验证集和测试集中的每个实例,都会生成一组预测结果。然后,使用验证集的残差和预测绝对误差来使用一致性预测计算测试集实例的预测误差。我们使用来自 ChEMBL 23 的 24 个生物活性数据集表明,随机失活一致性预测器是有效的(即,真实值位于预测区间内的实例的分数与置信水平强烈相关)和高效的,因为预测置信区间比使用随机森林 (RF) 模型生成的一致性预测器计算的置信区间范围更窄。最后,我们在回顾性虚拟筛选实验中表明,随机失活和基于 RF 的一致性预测器可以导致活性化合物的检索率相当。总的来说,我们提出了一种计算效率高的框架(除了训练单个网络外,仅需要额外进行 次前向传递)来利用测试时随机失活和一致性预测框架,该框架通常适用于在药物发现及其他领域生成可靠的深度神经网络预测误差。