通过文本进行视频检索的双重编码

Dual Encoding for Video Retrieval by Text.

作者信息

Dong Jianfeng, Li Xirong, Xu Chaoxi, Yang Xun, Yang Gang, Wang Xun, Wang Meng

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4065-4080. doi: 10.1109/TPAMI.2021.3059295. Epub 2022 Jul 1.

DOI:10.1109/TPAMI.2021.3059295

Abstract

This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space.

摘要

本文探讨了通过文本进行视频检索这一具有挑战性的问题。在这种检索范式中，终端用户通过仅以自然语言句子形式描述的即席查询来搜索未标记的视频，且不提供视觉示例。将视频视为帧序列，查询视为单词序列，有效的序列到序列跨模态匹配至关重要。为此，需要先将这两种模态编码为实值向量，然后投影到一个公共空间。在本文中，我们通过提出一种双深度编码网络来实现这一点，该网络将视频和查询编码为各自强大的密集表示。我们的创新点有两个方面。首先，与采用特定单级编码器的现有技术不同，所提出的网络执行多级编码，以从粗到细的方式表示两种模态的丰富内容。其次，与基于概念或基于潜在空间的传统公共空间学习算法不同，我们引入了混合空间学习，它结合了潜在空间的高性能和概念空间的良好可解释性。双编码在概念上简单，在实践中有效，并且通过混合空间学习进行端到端训练。在四个具有挑战性的视频数据集上进行的大量实验证明了该新方法的可行性。代码和数据可在https://github.com/danieljf24/hybrid_space获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过文本进行视频检索的双重编码

Dual Encoding for Video Retrieval by Text.

作者信息

出版信息

相似文献

引用本文的文献

通过文本进行视频检索的双重编码

Dual Encoding for Video Retrieval by Text.

作者信息

出版信息

相似文献

引用本文的文献