• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于组合式时间定位的变分跨图推理与自适应结构化语义学习

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.

作者信息

Li Juncheng, Tang Siliang, Zhu Linchao, Zhang Wenqiao, Yang Yi, Chua Tat-Seng, Wu Fei, Zhuang Yueting

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.

DOI:10.1109/TPAMI.2023.3274139
PMID:37155378
Abstract

Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. We empirically find that they fail to generalize to queries with novel combinations of seen words. We argue that the inherent compositional structure (i.e., composition constituents and their relationships) inside the videos and language is the crucial factor to achieve compositional generalization. Based on this insight, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns fine-grained semantic correspondence between the two graphs. Meanwhile, we introduce a novel adaptive structured semantics learning approach to derive the structure-informed and domain-generalizable graph representations, which facilitate the fine-grained semantic correspondence reasoning between the two graphs. To further evaluate the understanding of the compositional structure, we also introduce a more challenging setting, where one of the components in the novel composition is unseen. This requires more sophisticated understanding of the compositional structure to infer the potential semantics of the unseen word based on the other learned composition constituents appearing in both the video and language context, and their relationships. Extensive experiments validate the superior compositional generalizability of our approach, demonstrating its ability to handle queries with novel combinations of seen words as well as novel words in the testing composition.

摘要

时间定位是根据查询语句从未修剪的视频中定位特定片段的任务。由于该任务能够通过利用自然语言描述的语义多样性实现超越预定义活动类别的活动定位,因此在计算机视觉社区中获得了显著的发展势头。语义多样性源于语言学中的组合性原则,即新的语义可以通过以新的方式组合已知单词来系统地描述(组合泛化)。然而,现有的时间定位数据集并未经过精心设计以评估组合泛化能力。为了系统地评估时间定位模型的组合泛化能力,我们引入了一个新的组合时间定位任务,并构建了两个新的数据集划分,即Charades-CG和ActivityNet-CG。我们通过实验发现,它们无法泛化到包含已见单词新组合的查询。我们认为视频和语言内部固有的组合结构(即组合成分及其关系)是实现组合泛化的关键因素。基于这一见解,我们提出了一种变分跨图推理框架,该框架分别将视频和语言明确分解为分层语义图,并学习两个图之间的细粒度语义对应关系。同时,我们引入了一种新颖的自适应结构化语义学习方法,以获得结构感知且领域通用的图表示,这有助于两个图之间的细粒度语义对应推理。为了进一步评估对组合结构的理解,我们还引入了一个更具挑战性的设置,即在新组合中的一个组件是未见的。这需要对组合结构有更复杂的理解,以便根据在视频和语言上下文中出现的其他已学习组合成分及其关系来推断未见单词的潜在语义。大量实验验证了我们方法卓越的组合泛化能力,证明了其处理包含已见单词新组合以及测试组合中未见单词的查询的能力。

相似文献

1
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding.用于组合式时间定位的变分跨图推理与自适应结构化语义学习
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.
2
Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.用于视频中时态语言定位的多模态交互图卷积网络
IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.
3
SDN: Semantic Decoupling Network for Temporal Language Grounding.SDN:用于时态语言定位的语义解耦网络。
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6598-6612. doi: 10.1109/TNNLS.2022.3211850. Epub 2024 May 2.
4
Event-Oriented State Alignment Network for Weakly Supervised Temporal Language Grounding.用于弱监督时间语言定位的面向事件的状态对齐网络。
Entropy (Basel). 2024 Aug 27;26(9):730. doi: 10.3390/e26090730.
5
Text-Based Localization of Moments in a Video Corpus.视频语料库中基于文本的矩定位
IEEE Trans Image Process. 2021;30:8886-8899. doi: 10.1109/TIP.2021.3120038. Epub 2021 Oct 28.
6
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos.用于视频中时间性句子定位的语义条件动态调制
IEEE Trans Pattern Anal Mach Intell. 2022 May;44(5):2725-2741. doi: 10.1109/TPAMI.2020.3038993. Epub 2022 Apr 1.
7
Moment Retrieval via Cross-Modal Interaction Networks with Query Reconstruction.通过具有查询重构的跨模态交互网络进行时刻检索
IEEE Trans Image Process. 2020 Jan 17. doi: 10.1109/TIP.2020.2965987.
8
Path-based knowledge reasoning with textual semantic information for medical knowledge graph completion.基于路径的知识推理与文本语义信息融合的医疗知识图谱补全方法
BMC Med Inform Decis Mak. 2021 Nov 29;21(Suppl 9):335. doi: 10.1186/s12911-021-01622-7.
9
Interactive Natural Language Grounding via Referring Expression Comprehension and Scene Graph Parsing.通过指称表达理解和场景图解析实现交互式自然语言基础
Front Neurorobot. 2020 Jun 25;14:43. doi: 10.3389/fnbot.2020.00043. eCollection 2020.
10
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering.用于视频问答的事件图引导的组合式时空推理
IEEE Trans Image Process. 2024;33:1109-1121. doi: 10.1109/TIP.2024.3358726. Epub 2024 Feb 5.