IEEE Trans Pattern Anal Mach Intell. 2021 Mar;43(3):887-901. doi: 10.1109/TPAMI.2019.2943456. Epub 2021 Feb 4.
Collaborative reasoning for understanding image-question pairs is a very critical but underexplored topic in interpretable visual question answering systems. Although very recent studies have attempted to use explicit compositional processes to assemble multiple subtasks embedded in questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, which leads to either heavy workloads or poor performance on compositional reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question; thus, our model is called a parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module that exploits the local visual evidence of each word parsed from the question, ii) a gated residual composition module that composes the previously mined evidence, and iii) a parse-tree-guided propagation module that passes the mined evidence along the parse tree. Thus, PTGRN is capable of building an interpretable visual question answering (VQA) system that gradually derives image cues following question-driven parse-tree reasoning. Experiments on relational datasets demonstrate the superiority of PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.
标题:基于解析树引导推理的可解释视觉问答系统
摘要:协同推理是可解释视觉问答系统中的一个重要但研究较少的课题。尽管最近的研究尝试使用显式的组合过程来组合问题中嵌入的多个子任务,但它们的模型严重依赖于注释或手工规则来获取有效的推理过程,这导致了繁重的工作量或组合推理的性能不佳。在本文中,为了在不同和不受限制的情况下更好地对齐图像和语言领域,我们提出了一种新的神经网络模型,该模型对从问题中解析出的依赖树进行全局推理;因此,我们的模型称为解析树引导推理网络(PTGRN)。该网络由三个协同模块组成:i)注意力模块,利用问题中解析出的每个词的局部视觉证据,ii)门控残差组合模块,组合之前挖掘到的证据,iii)解析树引导传播模块,沿着解析树传递挖掘到的证据。因此,PTGRN 能够构建一个可解释的视觉问答(VQA)系统,该系统能够根据问题驱动的解析树推理逐步推导图像线索。在关系型数据集上的实验表明了 PTGRN 优于当前最先进的 VQA 方法的优越性,可视化结果突出了我们推理系统的可解释性。