Guo Yangyang, Jiao Fangkai, Shen Zhiqi, Nie Liqiang, Kankanhalli Mohan
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):10284-10296. doi: 10.1109/TPAMI.2024.3437288. Epub 2024 Nov 6.
Teaching Visual Question Answering (VQA) models to refrain from answering unanswerable questions is necessary for building a trustworthy AI system. Existing studies, though have explored various aspects of VQA but somewhat ignored this particular attribute. This paper aims to bridge the research gap by contributing a comprehensive dataset, called UNK-VQA. The dataset is specifically designed to address the challenge of questions that models do not know. To this end, we first augment the existing data via deliberate perturbations on either the image or question. In specific, we carefully ensure that the question-image semantics remain close to the original unperturbed distribution. By this means, the identification of unanswerable questions becomes challenging, setting our dataset apart from others that involve mere image replacement. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models and discover their significant limitations when applied to our dataset. Additionally, we also propose a straightforward method to tackle these unanswerable questions. This dataset, we believe, will serve as a valuable benchmark for enhancing the abstention capability of VQA models, thereby leading to increased trustworthiness of AI systems. We have made the dataset available to facilitate further exploration in this area.
对于构建一个值得信赖的人工智能系统而言,训练视觉问答(VQA)模型避免回答无法回答的问题是很有必要的。现有研究虽已探索了VQA的各个方面,但在一定程度上忽略了这一特殊属性。本文旨在通过贡献一个名为UNK-VQA的综合数据集来弥合这一研究差距。该数据集专门设计用于应对模型不知道答案的问题所带来的挑战。为此,我们首先通过对图像或问题进行有意的扰动来扩充现有数据。具体而言,我们仔细确保问题-图像语义与原始未扰动分布保持接近。通过这种方式,识别无法回答的问题变得具有挑战性,这使得我们的数据集有别于其他仅涉及图像替换的数据集。然后,我们广泛评估了几种新兴多模态大模型的零样本和少样本性能,并发现它们在应用于我们的数据集时存在显著局限性。此外,我们还提出了一种直接的方法来处理这些无法回答的问题。我们相信,这个数据集将成为增强VQA模型弃权能力的宝贵基准,从而提高人工智能系统的可信度。我们已提供该数据集以方便在这一领域进行进一步探索。