Shanghai Institute of Technical Physics of the Chinese Academy of Sciences, Shanghai 200083, China.
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China.
Sensors (Basel). 2022 Feb 17;22(4):1575. doi: 10.3390/s22041575.
Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image-question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.
基于知识的视觉问答中的协同推理具有挑战性,但对于理解图像和问题的特征至关重要且高效。虽然之前的方法通过注意力机制联合融合各种特征,或者使用手工规则生成布局以执行组合推理,但缺乏视觉推理过程,并为预测正确答案引入了大量参数。为了对各种图像-问题对进行视觉推理,在本文中,我们提出了一种基于知识库的问题引导树结构的新推理模型(QGTSKB)来解决这些问题。此外,我们的模型由四个神经模块网络组成:注意力模型,它通过注意力机制基于图像特征和问题嵌入来定位关注区域;门控推理模型,它忘记和更新融合特征;融合推理模型,它挖掘关注视觉特征和知识库的高级语义;以及基于外部知识弥补视觉和文本信息缺失的知识库事实模型。因此,我们的模型基于树结构、知识库和四个神经模块网络执行视觉分析和推理。实验结果表明,我们的模型在 VQA v2.0 和 CLVER 数据集上优于现有方法的性能,并且视觉推理实验证明了模型的可解释性。