IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):10443-10465. doi: 10.1109/TPAMI.2023.3258628. Epub 2023 Jun 30.
Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.
视频时间定位(TSGV),又名自然语言视频定位(NLVL)或视频时刻检索(VMR),旨在从未剪辑的视频中检索与语言查询语义对应的时间点。将计算机视觉和自然语言联系起来,TSGV 引起了两个领域研究人员的极大关注。本调查试图对 TSGV 的基本概念和当前研究状况以及未来研究方向进行总结。作为背景,我们以教程的形式介绍了 TSGV 中功能组件的通用结构:从原始视频和语言查询的特征提取,到目标时刻的答案预测。然后,我们回顾了多模态理解和交互技术,这是 TSGV 的关键重点,用于在两种模式之间进行有效的对齐。我们构建了 TSGV 技术的分类法,并详细阐述了不同类别中的方法,以及它们的优缺点。最后,我们讨论了当前 TSGV 研究中的问题,并分享了我们对有前途的研究方向的见解。