• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

视频中的时间句子基础:调查与未来方向

Temporal Sentence Grounding in Videos: A Survey and Future Directions.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):10443-10465. doi: 10.1109/TPAMI.2023.3258628. Epub 2023 Jun 30.

DOI:10.1109/TPAMI.2023.3258628
PMID:37030852
Abstract

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate the methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

摘要

视频时间定位(TSGV),又名自然语言视频定位(NLVL)或视频时刻检索(VMR),旨在从未剪辑的视频中检索与语言查询语义对应的时间点。将计算机视觉和自然语言联系起来,TSGV 引起了两个领域研究人员的极大关注。本调查试图对 TSGV 的基本概念和当前研究状况以及未来研究方向进行总结。作为背景,我们以教程的形式介绍了 TSGV 中功能组件的通用结构:从原始视频和语言查询的特征提取,到目标时刻的答案预测。然后,我们回顾了多模态理解和交互技术,这是 TSGV 的关键重点,用于在两种模式之间进行有效的对齐。我们构建了 TSGV 技术的分类法,并详细阐述了不同类别中的方法,以及它们的优缺点。最后,我们讨论了当前 TSGV 研究中的问题,并分享了我们对有前途的研究方向的见解。

相似文献

1
Temporal Sentence Grounding in Videos: A Survey and Future Directions.视频中的时间句子基础:调查与未来方向
IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):10443-10465. doi: 10.1109/TPAMI.2023.3258628. Epub 2023 Jun 30.
2
Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework.自然语言视频定位:基于跨度的问答框架中的再探讨。
IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4252-4266. doi: 10.1109/TPAMI.2021.3060449. Epub 2022 Jul 1.
3
Text-Based Localization of Moments in a Video Corpus.视频语料库中基于文本的矩定位
IEEE Trans Image Process. 2021;30:8886-8899. doi: 10.1109/TIP.2021.3120038. Epub 2021 Oct 28.
4
Towards Visual-Prompt Temporal Answer Grounding in Instructional Video.迈向教学视频中的视觉提示时间答案定位
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8836-8853. doi: 10.1109/TPAMI.2024.3411045. Epub 2024 Nov 6.
5
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.用于视频中时态语言定位的多模态交互图卷积网络
IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.
6
Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.用于弱监督时间语言定位的面向事件的状态对齐网络。
Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.
7
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.用于视频中时间性句子定位的语义条件动态调制
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.
8
Zero-Shot Video Grounding With Pseudo Query Lookup and Verification.基于伪查询查找与验证的零样本视频定位
IEEE Trans Image Process. 2024;33:1643-1654. doi: 10.1109/TIP.2024.3365249. Epub 2024 Feb 27.
9
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.用于组合式时间定位的变分跨图推理与自适应结构化语义学习
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.
10
M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization.M2DCapsN:用于自然语言时刻定位的多模态、多通道和双步胶囊网络
IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):11448-11462. doi: 10.1109/TNNLS.2023.3261927. Epub 2024 Aug 5.