IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7190-7204. doi: 10.1109/TPAMI.2021.3093360. Epub 2022 Sep 14.
Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the natural language dialog, to supplement the missing visual information. Specifically, one of the intelligent agents - Q-BOT - is given two semantic segmented frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions. We introduce two different experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog generation process. With the proposed unified QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of using the natural language dialog as a supplement for incomplete implicit visions.
目前的视觉语言任务通常以完整的视觉数据(例如原始图像或视频)作为输入,然而,实际场景中可能经常会出现由于各种原因导致部分视觉信息无法获取的情况,例如由于固定相机的限制视角或出于安全考虑的故意视觉遮挡。为了向更实际的应用场景迈进,我们引入了一项新颖的任务,旨在使用两个代理之间的自然语言对话作为补充信息源,来描述一个部分视觉数据不可用的视频。与大多数现有的视觉语言任务不同,这些任务中人工智能系统可以完全访问图像或视频片段,这些片段可能会揭示敏感信息,例如可识别的人脸或声音,我们有意限制人工智能系统的视觉输入,并寻求更安全和透明的信息媒介,即自然语言对话,来补充缺失的视觉信息。具体来说,其中一个智能代理——Q-BOT——从视频的开头和结尾获得两个语义分割的帧,以及有限数量的机会在描述未见过的视频之前询问相关的自然语言问题。另一个代理 A-BOT 可以访问整个视频,通过回答所问的问题来协助 Q-BOT 完成任务。我们引入了两种不同的实验设置,分别采用生成式(即代理自由生成问题和答案)或判别式(即代理从候选问题和答案中选择)的内部对话生成过程。通过提出的统一的 QA-Cooperative 网络,我们实验证明了两个对话代理之间的知识转移过程,以及将自然语言对话作为补充不完整隐式视觉信息的有效性。