Li Juncheng, Tang Siliang, Zhu Linchao, Zhang Wenqiao, Yang Yi, Chua Tat-Seng, Wu Fei, Zhuang Yueting
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):12601-12617. doi: 10.1109/TPAMI.2023.3274139. Epub 2023 Sep 5.
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. We empirically find that they fail to generalize to queries with novel combinations of seen words. We argue that the inherent compositional structure (i.e., composition constituents and their relationships) inside the videos and language is the crucial factor to achieve compositional generalization. Based on this insight, we propose a variational cross-graph reasoning framework that explicitly decomposes video and language into hierarchical semantic graphs, respectively, and learns fine-grained semantic correspondence between the two graphs. Meanwhile, we introduce a novel adaptive structured semantics learning approach to derive the structure-informed and domain-generalizable graph representations, which facilitate the fine-grained semantic correspondence reasoning between the two graphs. To further evaluate the understanding of the compositional structure, we also introduce a more challenging setting, where one of the components in the novel composition is unseen. This requires more sophisticated understanding of the compositional structure to infer the potential semantics of the unseen word based on the other learned composition constituents appearing in both the video and language context, and their relationships. Extensive experiments validate the superior compositional generalizability of our approach, demonstrating its ability to handle queries with novel combinations of seen words as well as novel words in the testing composition.
时间定位是根据查询语句从未修剪的视频中定位特定片段的任务。由于该任务能够通过利用自然语言描述的语义多样性实现超越预定义活动类别的活动定位,因此在计算机视觉社区中获得了显著的发展势头。语义多样性源于语言学中的组合性原则,即新的语义可以通过以新的方式组合已知单词来系统地描述(组合泛化)。然而,现有的时间定位数据集并未经过精心设计以评估组合泛化能力。为了系统地评估时间定位模型的组合泛化能力,我们引入了一个新的组合时间定位任务,并构建了两个新的数据集划分,即Charades-CG和ActivityNet-CG。我们通过实验发现,它们无法泛化到包含已见单词新组合的查询。我们认为视频和语言内部固有的组合结构(即组合成分及其关系)是实现组合泛化的关键因素。基于这一见解,我们提出了一种变分跨图推理框架,该框架分别将视频和语言明确分解为分层语义图,并学习两个图之间的细粒度语义对应关系。同时,我们引入了一种新颖的自适应结构化语义学习方法,以获得结构感知且领域通用的图表示,这有助于两个图之间的细粒度语义对应推理。为了进一步评估对组合结构的理解,我们还引入了一个更具挑战性的设置,即在新组合中的一个组件是未见的。这需要对组合结构有更复杂的理解,以便根据在视频和语言上下文中出现的其他已学习组合成分及其关系来推断未见单词的潜在语义。大量实验验证了我们方法卓越的组合泛化能力,证明了其处理包含已见单词新组合以及测试组合中未见单词的查询的能力。