Su Ke, Su Hang, Li Jianguo, Zhu Jun
THBI Lab, Department of Computer Science and Technology, BNRist Center, Institute for AI, Tsinghua University, Beijing, China.
Intel Labs China, Beijing, China.
Front Robot AI. 2020 Aug 21;7:109. doi: 10.3389/frobt.2020.00109. eCollection 2020.
Visual reasoning is a critical stage in visual question answering (Antol et al., 2015), but most of the state-of-the-art methods categorized the VQA tasks as a classification problem without taking the reasoning process into account. Various approaches are proposed to solve this multi-modal task that requires both abilities of comprehension and reasoning. The recently proposed neural module network (Andreas et al., 2016b), which assembles the model with a few primitive modules, is capable of performing a spatial or arithmetical reasoning over the input image to answer the questions. Nevertheless, its performance is not satisfying especially in the real-world datasets (e.g., VQA 1.0& 2.0) due to its limited primitive modules and suboptimal layout. To address these issues, we propose a novel method of Dual-Path Neural Module Network which can implement complex visual reasoning by forming a more flexible layout regularized by the pairwise loss. Specifically, we first use the region proposal network to generate both visual and spatial information, which helps it perform spatial reasoning. Then, we advocate to process a pair of different images along with the same question simultaneously, named as a "complementary pair," which encourages the model to learn a more reasonable layout by suppressing the overfitting to the language priors. The model can jointly learn the parameters in the primitive module and the layout generation policy, which is further boosted by introducing a novel pairwise reward. Extensive experiments show that our approach significantly improves the performance of neural module networks especially on the real-world datasets.
视觉推理是视觉问答中的一个关键阶段(安托尔等人,2015年),但大多数现有技术方法将视觉问答任务归类为分类问题,而没有考虑推理过程。人们提出了各种方法来解决这个需要理解和推理能力的多模态任务。最近提出的神经模块网络(安德烈亚斯等人,2016b),它用一些原始模块组装模型,能够对输入图像进行空间或算术推理以回答问题。然而,由于其原始模块有限且布局欠佳,其性能并不令人满意,尤其是在真实世界数据集(如VQA 1.0和2.0)中。为了解决这些问题,我们提出了一种双路径神经模块网络的新方法,它可以通过形成由成对损失正则化的更灵活布局来实现复杂的视觉推理。具体来说,我们首先使用区域提议网络生成视觉和空间信息,这有助于它进行空间推理。然后,我们主张同时处理一对不同的图像以及相同的问题,称为“互补对”,这通过抑制对语言先验的过拟合来鼓励模型学习更合理的布局。该模型可以联合学习原始模块中的参数和布局生成策略,通过引入一种新颖的成对奖励进一步提升性能。大量实验表明,我们的方法显著提高了神经模块网络的性能,尤其是在真实世界数据集上。