Xu Zhe, Chen Da, Wei Kun, Deng Cheng, Xue Hui
IEEE Trans Image Process. 2022;31:5178-5188. doi: 10.1109/TIP.2022.3191841. Epub 2022 Aug 4.
Video Temporal Grounding (VTG) aims to locate the time interval in a video that is semantically relevant to a language query. Existing VTG methods interact the query with entangled video features and treat the instances in a dataset independently. However, intra-video entanglement and inter-video connection are rarely considered in these methods, leading to mismatches between the video and language. To this end, we propose a novel method, dubbed Hierarchically Semantic Associating (HiSA), which aims to precisely align the video with language and obtain discriminative representation for further location regression. Specifically, the action factors and background factors are disentangled from adjacent video segments, enforcing precise multimodal interaction and alleviating the intra-video entanglement. In addition, cross-guided contrast is elaborately framed to capture the inter-video connection, which benefits the multimodal understanding to locate the time interval. Extensive experiments on three benchmark datasets demonstrate that our approach significantly outperforms the state-of-the-art methods. The project page is available at: https://github.com/zhexu1997/HiSA.
视频时间定位(VTG)旨在在视频中定位与语言查询在语义上相关的时间间隔。现有的VTG方法将查询与纠缠的视频特征进行交互,并独立处理数据集中的实例。然而,这些方法很少考虑视频内的纠缠和视频间的联系,导致视频与语言之间的不匹配。为此,我们提出了一种新颖的方法,称为层次语义关联(HiSA),其目的是将视频与语言精确对齐,并获得用于进一步位置回归的判别性表示。具体而言,动作因素和背景因素从相邻视频片段中解缠,加强精确的多模态交互并减轻视频内的纠缠。此外,精心构建交叉引导对比以捕捉视频间的联系,这有利于多模态理解以定位时间间隔。在三个基准数据集上进行的大量实验表明,我们的方法显著优于现有最先进的方法。项目页面可在:https://github.com/zhexu1997/HiSA获取。