Qi Mengshi, Qin Jie, Yang Yi, Wang Yunhong, Luo Jiebo
IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.
With the current exponential growth of video-based social networks, video retrieval using natural language is receiving ever-increasing attention. Most existing approaches tackle this task by extracting individual frame-level spatial features to represent the whole video, while ignoring visual pattern consistencies and intrinsic temporal relationships across different frames. Furthermore, the semantic correspondence between natural language queries and person-centric actions in videos has not been fully explored. To address these problems, we propose a novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries ( [Formula: see text]Bin), which simultaneously considers spatial-temporal context and semantic relationships for cross-modal video retrieval. By exploiting the semantic relationships between two modalities, [Formula: see text]Bin can efficiently and effectively generate binary codes for both videos and texts. In addition, we adopt an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training. We evaluate our model on three video datasets and the experimental results demonstrate that [Formula: see text]Bin outperforms the state-of-the-art methods in terms of various cross-modal video retrieval tasks.
随着当前基于视频的社交网络呈指数级增长,使用自然语言的视频检索受到越来越多的关注。大多数现有方法通过提取单个帧级空间特征来表示整个视频来处理此任务,而忽略了不同帧之间的视觉模式一致性和内在时间关系。此外,自然语言查询与视频中以人物为中心的动作之间的语义对应关系尚未得到充分探索。为了解决这些问题,我们提出了一种新颖的二进制表示学习框架,名为语义感知时空二进制([公式:见正文]Bin),它同时考虑空间-时间上下文和语义关系以进行跨模态视频检索。通过利用两种模态之间的语义关系,[公式:见正文]Bin可以高效且有效地为视频和文本生成二进制代码。此外,我们采用迭代优化方案来学习具有属性引导随机训练的深度编码函数。我们在三个视频数据集上评估我们的模型,实验结果表明,在各种跨模态视频检索任务方面,[公式:见正文]Bin优于现有最先进的方法。