IEEE Trans Image Process. 2023;32:3367-3382. doi: 10.1109/TIP.2023.3276570. Epub 2023 Jun 19.
Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grained spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into the spatial reasoning process to capture the contextual knowledge of key objects step-by-step. Specifically, (i) we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii) we design a depth-aware attention calibration module for calibrating the OCR tokens' attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7% and 12.1% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.
基于文本的视觉问答 (TextVQA) 的目标是针对带有多个场景文本的图像,为给定的问题生成正确的答案。在大多数情况下,文本自然附着在物体的表面上。因此,文本和对象之间的空间推理在 TextVQA 中至关重要。然而,现有的方法受到输入图像中 2D 空间信息的限制,并依赖于基于转换器的架构在融合过程中进行隐式推理。在这种设置下,这些 2D 空间推理方法无法区分同一图像平面上视觉对象和场景文本之间的细粒度空间关系,从而损害了 TextVQA 模型的可解释性和性能。在本文中,我们将 3D 几何信息引入空间推理过程中,逐步捕获关键对象的上下文知识。具体来说:(i)我们提出了一种关系预测模块,用于准确定位关键对象的感兴趣区域;(ii)我们设计了一种深度感知注意力校准模块,根据关键对象校准 OCR 标记的注意力。大量实验表明,我们的方法在 TextVQA 和 ST-VQA 数据集上实现了最先进的性能。更令人鼓舞的是,我们的模型在涉及 TextVQA 和 ST-VQA 有效分裂中空间推理的问题上,比其他模型高出 5.7%和 12.1%的明显差距。此外,我们还在基于文本的图像字幕任务上验证了我们模型的泛化能力。