言说未见之物：借助对话代理生成视频描述

Saying the Unseen: Video Descriptions via Dialog Agents.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7190-7204. doi: 10.1109/TPAMI.2021.3093360. Epub 2022 Sep 14.

DOI:10.1109/TPAMI.2021.3093360

Abstract

Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the natural language dialog, to supplement the missing visual information. Specifically, one of the intelligent agents - Q-BOT - is given two semantic segmented frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions. We introduce two different experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog generation process. With the proposed unified QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of using the natural language dialog as a supplement for incomplete implicit visions.

摘要

目前的视觉语言任务通常以完整的视觉数据（例如原始图像或视频）作为输入，然而，实际场景中可能经常会出现由于各种原因导致部分视觉信息无法获取的情况，例如由于固定相机的限制视角或出于安全考虑的故意视觉遮挡。为了向更实际的应用场景迈进，我们引入了一项新颖的任务，旨在使用两个代理之间的自然语言对话作为补充信息源，来描述一个部分视觉数据不可用的视频。与大多数现有的视觉语言任务不同，这些任务中人工智能系统可以完全访问图像或视频片段，这些片段可能会揭示敏感信息，例如可识别的人脸或声音，我们有意限制人工智能系统的视觉输入，并寻求更安全和透明的信息媒介，即自然语言对话，来补充缺失的视觉信息。具体来说，其中一个智能代理——Q-BOT——从视频的开头和结尾获得两个语义分割的帧，以及有限数量的机会在描述未见过的视频之前询问相关的自然语言问题。另一个代理 A-BOT 可以访问整个视频，通过回答所问的问题来协助 Q-BOT 完成任务。我们引入了两种不同的实验设置，分别采用生成式（即代理自由生成问题和答案）或判别式（即代理从候选问题和答案中选择）的内部对话生成过程。通过提出的统一的 QA-Cooperative 网络，我们实验证明了两个对话代理之间的知识转移过程，以及将自然语言对话作为补充不完整隐式视觉信息的有效性。

相似文献

Saying the Unseen: Video Descriptions via Dialog Agents.言说未见之物：借助对话代理生成视频描述

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7190-7204. doi: 10.1109/TPAMI.2021.3093360. Epub 2022 Sep 14.

SemBioNLQA: A semantic biomedical question answering system for retrieving exact and ideal answers to natural language questions.SemBioNLQA：一个语义生物医学问答系统，用于检索自然语言问题的准确和理想答案。

Artif Intell Med. 2020 Jan;102:101767. doi: 10.1016/j.artmed.2019.101767. Epub 2019 Nov 28.

Knowledge graph assisted end-to-end medical dialog generation.知识图谱辅助的端到端医学对话生成

Artif Intell Med. 2023 May;139:102535. doi: 10.1016/j.artmed.2023.102535. Epub 2023 Mar 23.

A dataset for medical instructional video classification and question answering.用于医学教学视频分类和问答的数据集。

Sci Data. 2023 Mar 22;10(1):158. doi: 10.1038/s41597-023-02036-y.

Visual Dialog.视觉对话

IEEE Trans Pattern Anal Mach Intell. 2018 Apr 19. doi: 10.1109/TPAMI.2018.2828437.

Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.用于视频问答的双流融合组合注意力网络。

IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition.BIOASQ大规模生物医学语义索引与问答竞赛概述。

BMC Bioinformatics. 2015 Apr 30;16:138. doi: 10.1186/s12859-015-0564-6.

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.用于视频问答的自适应时空图增强视觉语言表示

IEEE Trans Image Process. 2021;30:5477-5489. doi: 10.1109/TIP.2021.3076556. Epub 2021 Jun 11.

Concept based auto-assignment of healthcare questions to domain experts in online Q&A communities.基于概念的在线问答社区中医疗问题自动分配给领域专家

Int J Med Inform. 2020 May;137:104108. doi: 10.1016/j.ijmedinf.2020.104108. Epub 2020 Mar 6.

Talk-to-Edit: Fine-Grained 2D and 3D Facial Editing via Dialog.通过对话进行精细的二维和三维面部编辑：对话式编辑

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3692-3706. doi: 10.1109/TPAMI.2023.3347299. Epub 2024 Apr 3.

言说未见之物：借助对话代理生成视频描述

Saying the Unseen: Video Descriptions via Dialog Agents.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7190-7204. doi: 10.1109/TPAMI.2021.3093360. Epub 2022 Sep 14.

DOI:10.1109/TPAMI.2021.3093360

PMID:34185637

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

言说未见之物：借助对话代理生成视频描述

Saying the Unseen: Video Descriptions via Dialog Agents.

出版信息

相似文献

言说未见之物：借助对话代理生成视频描述

Saying the Unseen: Video Descriptions via Dialog Agents.

出版信息

相似文献