Department of Computer Science, Kyonggi University, Suwon 16227, Korea.
Sensors (Basel). 2021 Jan 30;21(3):931. doi: 10.3390/s21030931.
Visual dialog demonstrates several important aspects of multimodal artificial intelligence; however, it is hindered by visual grounding and visual coreference resolution problems. To overcome these problems, we propose the novel neural module network for visual dialog (NMN-VD). NMN-VD is an efficient question-customized modular network model that combines only the modules required for deciding answers after analyzing input questions. In particular, the model includes a module that effectively finds the visual area indicated by a pronoun using a reference pool to solve a visual coreference resolution problem, which is an important challenge in visual dialog. In addition, the proposed NMN-VD model includes a method for distinguishing and handling impersonal pronouns that do not require visual coreference resolution from general pronouns. Furthermore, a new module that effectively handles comparison questions found in visual dialogs is included in the model, as well as a module that applies a triple-attention mechanism to solve visual grounding problems between the question and the image. The results of various experiments conducted using a set of large-scale benchmark data verify the efficacy and high performance of our proposed NMN-VD model.
视觉对话展示了多模态人工智能的几个重要方面;然而,它受到视觉基础和视觉同指消解问题的阻碍。为了解决这些问题,我们提出了用于视觉对话的新型神经模块网络 (NMN-VD)。NMN-VD 是一种高效的问题定制模块网络模型,它在分析输入问题后,只结合决定答案所需的模块。特别是,该模型包括一个模块,该模块使用引用池有效地找到代词所指示的视觉区域,以解决视觉同指消解问题,这是视觉对话中的一个重要挑战。此外,所提出的 NMN-VD 模型还包括一种区分和处理不需要视觉同指消解的非人称代词与一般代词的方法。此外,该模型还包括一个新的模块,用于有效地处理视觉对话中发现的比较问题,以及一个应用三重注意力机制解决问题和图像之间的视觉基础问题的模块。使用一组大规模基准数据进行的各种实验结果验证了我们提出的 NMN-VD 模型的有效性和高性能。