Ahmed Sabeen, Dera Dimah, Hassan Saud Ul, Bouaynaya Nidhal, Rasool Ghulam
Department of Electrical and Computer Engineering, Rowan University, Glassboro, NJ, United States.
University of Texas Rio Grande Valley, Brownsville, TX, United States.
Front Med Technol. 2022 Jul 22;4:919046. doi: 10.3389/fmedt.2022.919046. eCollection 2022.
Deep neural networks (DNNs) have started to find their role in the modern healthcare system. DNNs are being developed for diagnosis, prognosis, treatment planning, and outcome prediction for various diseases. With the increasing number of applications of DNNs in modern healthcare, their trustworthiness and reliability are becoming increasingly important. An essential aspect of trustworthiness is detecting the performance degradation and failure of deployed DNNs in medical settings. The softmax output values produced by DNNs are not a calibrated measure of model confidence. Softmax probability numbers are generally higher than the actual model confidence. The model confidence-accuracy gap further increases for wrong predictions and noisy inputs. We employ recently proposed Bayesian deep neural networks (BDNNs) to learn uncertainty in the model parameters. These models simultaneously output the predictions and a measure of confidence in the predictions. By testing these models under various noisy conditions, we show that the (learned) predictive confidence is well calibrated. We use these reliable confidence values for monitoring performance degradation and failure detection in DNNs. We propose two different failure detection methods. In the first method, we define a fixed threshold value based on the behavior of the predictive confidence with changing signal-to-noise ratio (SNR) of the test dataset. The second method learns the threshold value with a neural network. The proposed failure detection mechanisms seamlessly abstain from making decisions when the confidence of the BDNN is below the defined threshold and hold the decision for manual review. Resultantly, the accuracy of the models improves on the unseen test samples. We tested our proposed approach on three medical imaging datasets: PathMNIST, DermaMNIST, and OrganAMNIST, under different levels and types of noise. An increase in the noise of the test images increases the number of abstained samples. BDNNs are inherently robust and show more than 10% accuracy improvement with the proposed failure detection methods. The increased number of abstained samples or an abrupt increase in the predictive variance indicates model performance degradation or possible failure. Our work has the potential to improve the trustworthiness of DNNs and enhance user confidence in the model predictions.
深度神经网络(DNN)已开始在现代医疗系统中发挥作用。目前正在开发用于各种疾病的诊断、预后、治疗规划和结果预测的DNN。随着DNN在现代医疗保健中的应用越来越多,其可信度和可靠性变得越来越重要。可信度的一个重要方面是检测在医疗环境中部署的DNN的性能下降和故障。DNN产生的softmax输出值不是模型置信度的校准度量。softmax概率数通常高于实际模型置信度。对于错误预测和噪声输入,模型置信度与准确率之间的差距会进一步增大。我们采用最近提出的贝叶斯深度神经网络(BDNN)来学习模型参数中的不确定性。这些模型同时输出预测结果和对预测结果的置信度度量。通过在各种噪声条件下测试这些模型,我们表明(学习到的)预测置信度得到了很好的校准。我们使用这些可靠的置信值来监测DNN中的性能下降和故障检测。我们提出了两种不同的故障检测方法。在第一种方法中,我们根据测试数据集的信噪比(SNR)变化时预测置信度的行为定义一个固定阈值。第二种方法使用神经网络学习阈值。当BDNN的置信度低于定义的阈值时,所提出的故障检测机制会无缝地避免做出决策,并保留决策以供人工审核。结果,模型在未见过的测试样本上的准确率得到了提高。我们在三个医学成像数据集(PathMNIST、DermaMNIST和OrganAMNIST)上,在不同水平和类型的噪声下测试了我们提出的方法。测试图像噪声的增加会导致弃权样本数量增加。BDNN本质上具有鲁棒性,并且通过所提出的故障检测方法,准确率提高了10%以上。弃权样本数量的增加或预测方差的突然增加表明模型性能下降或可能出现故障。我们的工作有可能提高DNN的可信度,并增强用户对模型预测的信心。