Suppr超能文献

迈向教学视频中的视觉提示时间答案定位

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

作者信息

Li Shutao, Li Bin, Sun Bin, Weng Yixuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.

Abstract

Temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question, i.e., visual answer. Existing methods tend to solve the TAGV problem with a visual span-based predictor, taking visual information to predict the start and end frames in the video. However, due to the weak correlations between the semantic features of the textual question and visual answer, current methods using the visual span-based predictor do not work well in the TAGV task. In this paper, we propose a visual-prompt text span localization (VPTSL) method, which introduces the timestamped subtitles for a text span-based predictor. Specifically, the visual prompt is a learnable feature embedding, which brings visual knowledge to the pre-trained language model. Meanwhile, the text span-based predictor learns joint semantic representations from the input text question, video subtitles, and visual prompt feature with the pre-trained language model. Thus, the TAGV is reformulated as the task of the visual-prompt subtitle span localization for the visual answer. Extensive experiments on five instructional video datasets, namely MedVidQA, TutorialVQA, VehicleVQA, CrossTalk and Coin, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor.

摘要

教学视频中的时间答案定位(TAGV)是一项自然地从一般视频中的时间句子定位(TSGV)衍生而来的新任务。给定一个未剪辑的教学视频和一个文本问题,该任务旨在从视频中定位能够语义回答该问题的帧跨度,即视觉答案。现有方法倾向于使用基于视觉跨度的预测器来解决TAGV问题,利用视觉信息来预测视频中的起始帧和结束帧。然而,由于文本问题的语义特征与视觉答案之间的相关性较弱,当前使用基于视觉跨度的预测器的方法在TAGV任务中效果不佳。在本文中,我们提出了一种视觉提示文本跨度定位(VPTSL)方法,该方法为基于文本跨度的预测器引入了带时间戳的字幕。具体来说,视觉提示是一种可学习的特征嵌入,它将视觉知识引入到预训练语言模型中。同时,基于文本跨度的预测器通过预训练语言模型从输入的文本问题、视频字幕和视觉提示特征中学习联合语义表示。因此,TAGV被重新表述为视觉答案的视觉提示字幕跨度定位任务。在五个教学视频数据集,即MedVidQA、TutorialVQA、VehicleVQA、CrossTalk和Coin上进行的大量实验表明,所提出的方法在mIoU分数方面大幅优于几种现有的(SOTA)方法,这证明了所提出的视觉提示和基于文本跨度的预测器的有效性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验