Zhang Zongmeng, Han Xianjing, Song Xuemeng, Yan Yan, Nie Liqiang
IEEE Trans Image Process. 2021;30:8265-8277. doi: 10.1109/TIP.2021.3113791. Epub 2021 Sep 30.
This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.
本文聚焦于解决视频中的时间语言定位问题,该问题旨在识别未剪辑视频中自然语言句子所描述时刻的起始和结束点。然而,这并非易事,因为它不仅需要对视频和句子查询有全面的理解,还需要准确捕捉它们之间的语义对应关系。现有工作主要集中在探索视频片段和查询词之间的顺序关系以推理视频和句子查询,而忽略了其他模态内关系(例如视频片段之间的语义相似性和查询词之间的句法依存关系)。为此,在这项工作中,我们提出了一种多模态交互图卷积网络(MIGCN),它联合探索视频和句子查询中存在的复杂模态内关系和模态间交互,以促进对视频和句子查询的理解以及语义对应关系的捕捉。此外,我们设计了一种自适应上下文感知定位方法,其中将上下文信息纳入候选时刻,并设计多尺度全连接层来对不同长度的生成的粗略候选时刻的边界进行排序和调整。在Charades-STA和ActivityNet数据集上进行的大量实验证明了我们模型的良好性能和卓越效率。