Das Abhishek, Kottur Satwik, Gupta Khushi, Singh Avi, Yadav Deshraj, Lee Stefan, Moura Jose, Parikh Devi, Batra Dhruv
IEEE Trans Pattern Anal Mach Intell. 2018 Apr 19. doi: 10.1109/TPAMI.2018.2828437.
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of dialog question-answer pairs from 10-round, human-human dialogs grounded in images from the COCO dataset.
我们引入了视觉对话任务,该任务要求人工智能代理以自然的对话语言与人类就视觉内容进行有意义的对话。具体而言,给定一张图像、一段对话历史以及一个关于该图像的问题,代理必须将问题与图像关联起来,从历史中推断上下文,并准确回答问题。视觉对话与特定的下游任务足够解耦,从而可作为机器智能的通用测试,同时又充分基于视觉,以便对个体回答进行客观评估并衡量基准进展。我们开发了一种新颖的两人实时聊天数据收集协议,以构建一个大规模的视觉对话数据集(VisDial)。VisDial v0.9已经发布,它由基于COCO数据集中的图像进行的10轮人人对话的对话问答对组成。