迈向教学视频中的视觉提示时间答案定位

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

作者信息

Li Shutao, Li Bin, Sun Bin, Weng Yixuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3411045

Abstract

Temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question, i.e., visual answer. Existing methods tend to solve the TAGV problem with a visual span-based predictor, taking visual information to predict the start and end frames in the video. However, due to the weak correlations between the semantic features of the textual question and visual answer, current methods using the visual span-based predictor do not work well in the TAGV task. In this paper, we propose a visual-prompt text span localization (VPTSL) method, which introduces the timestamped subtitles for a text span-based predictor. Specifically, the visual prompt is a learnable feature embedding, which brings visual knowledge to the pre-trained language model. Meanwhile, the text span-based predictor learns joint semantic representations from the input text question, video subtitles, and visual prompt feature with the pre-trained language model. Thus, the TAGV is reformulated as the task of the visual-prompt subtitle span localization for the visual answer. Extensive experiments on five instructional video datasets, namely MedVidQA, TutorialVQA, VehicleVQA, CrossTalk and Coin, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor.

摘要

教学视频中的时间答案定位（TAGV）是一项自然地从一般视频中的时间句子定位（TSGV）衍生而来的新任务。给定一个未剪辑的教学视频和一个文本问题，该任务旨在从视频中定位能够语义回答该问题的帧跨度，即视觉答案。现有方法倾向于使用基于视觉跨度的预测器来解决TAGV问题，利用视觉信息来预测视频中的起始帧和结束帧。然而，由于文本问题的语义特征与视觉答案之间的相关性较弱，当前使用基于视觉跨度的预测器的方法在TAGV任务中效果不佳。在本文中，我们提出了一种视觉提示文本跨度定位（VPTSL）方法，该方法为基于文本跨度的预测器引入了带时间戳的字幕。具体来说，视觉提示是一种可学习的特征嵌入，它将视觉知识引入到预训练语言模型中。同时，基于文本跨度的预测器通过预训练语言模型从输入的文本问题、视频字幕和视觉提示特征中学习联合语义表示。因此，TAGV被重新表述为视觉答案的视觉提示字幕跨度定位任务。在五个教学视频数据集，即MedVidQA、TutorialVQA、VehicleVQA、CrossTalk和Coin上进行的大量实验表明，所提出的方法在mIoU分数方面大幅优于几种现有的（SOTA）方法，这证明了所提出的视觉提示和基于文本跨度的预测器的有效性。

相似文献

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.迈向教学视频中的视觉提示时间答案定位

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.

Text-Based Localization of Moments in a Video Corpus.视频语料库中基于文本的矩定位

IEEE Trans Image Process. 2021;30:8886-8899. doi: 10.1109/TIP.2021.3120038. Epub 2021 Oct 28.

Temporal Sentence Grounding in Videos: A Survey and Future Directions.视频中的时间句子基础：调查与未来方向

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):10443-10465. doi: 10.1109/TPAMI.2023.3258628. Epub 2023 Jun 30.

Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.用于弱监督时间语言定位的面向事件的状态对齐网络。

Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.

Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework.自然语言视频定位：基于跨度的问答框架中的再探讨。

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4252-4266. doi: 10.1109/TPAMI.2021.3060449. Epub 2022 Jul 1.

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.用于组合式时间定位的变分跨图推理与自适应结构化语义学习

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.

Dual modality prompt learning for visual question-grounded answering in robotic surgery.用于机器人手术中视觉问题引导式回答的双模态提示学习

Vis Comput Ind Biomed Art. 2024 Apr 22;7(1):9. doi: 10.1186/s42492-024-00160-z.

SDN: Semantic Decoupling Network for Temporal Language Grounding.SDN：用于时态语言定位的语义解耦网络。

IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6598-6612. doi: 10.1109/TNNLS.2022.3211850. Epub 2024 May 2.

A dataset for medical instructional video classification and question answering.用于医学教学视频分类和问答的数据集。

Sci Data. 2023 Mar 22;10(1):158. doi: 10.1038/s41597-023-02036-y.

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding.用于弱监督时间句子定位的局部对应网络

IEEE Trans Image Process. 2021;30:3252-3262. doi: 10.1109/TIP.2021.3058614. Epub 2021 Mar 2.

迈向教学视频中的视觉提示时间答案定位

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

作者信息

Li Shutao, Li Bin, Sun Bin, Weng Yixuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.

DOI:10.1109/TPAMI.2024.3411045

PMID:38848233

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

迈向教学视频中的视觉提示时间答案定位

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

作者信息

出版信息

相似文献

迈向教学视频中的视觉提示时间答案定位

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.

作者信息

出版信息

相似文献