Faculty of Computer Science, University of Koblenz-Landau, 56070 Koblenz, Germany.
Fraunhofer Institute for Software and Systems Engineering ISST, 44227 Dortmund, Germany.
Sensors (Basel). 2022 Mar 14;22(6):2245. doi: 10.3390/s22062245.
Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models' behaviour.
由于自然语言处理和基于计算机视觉的模型取得了重大进展,视觉问答 (VQA) 系统正变得更加智能和先进。然而,当处理相对复杂的问题时,它们仍然容易出错。因此,在采用 VQA 模型的结果之前,了解它们的行为是很重要的。在本文中,我们通过生成反事实图像为 VQA 模型引入了一种可解释性方法。具体来说,生成的图像应该对原始图像进行最小的可能更改,并引导 VQA 模型给出不同的答案。此外,我们的方法确保生成的图像是真实的。由于无法使用定量指标来评估模型的可解释性,我们进行了一项用户研究来评估我们方法的不同方面。除了对单个图像上的 VQA 模型结果进行解释外,获得的结果和讨论还对 VQA 模型的行为提供了广泛的解释。