IEEE Trans Image Process. 2018 Jul;27(7):3210-3221. doi: 10.1109/TIP.2018.2814344.
Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework dubbed self-supervised video hashing (SSVH), which is able to capture the temporal nature of videos in an end-to-end learning to hash fashion. We specifically address two central problems: 1) how to design an encoder-decoder architecture to generate binary codes for videos and 2) how to equip the binary codes with the ability of accurate video retrieval. We design a hierarchical binary auto-encoder to model the temporal dependencies in videos with multiple granularities, and embed the videos into binary codes with less computations than the stacked architecture. Then, we encourage the binary codes to simultaneously reconstruct the visual content and neighborhood structure of the videos. Experiments on two real-world data sets show that our SSVH method can significantly outperform the state-of-the-art methods and achieve the current best performance on the task of unsupervised video retrieval.
帧池化、松弛学习和二值化,它们没有在联合二进制优化模型中充分探索视频帧的时间顺序,导致严重的信息丢失。在本文中,我们提出了一种新的无监督视频哈希框架,称为自监督视频哈希(SSVH),它能够以端到端学习哈希的方式捕捉视频的时间特性。我们特别解决了两个核心问题:1)如何设计编码器-解码器架构为视频生成二进制代码,2)如何为二进制代码配备准确视频检索的能力。我们设计了一个分层二进制自动编码器,以多种粒度对视频中的时间依赖性进行建模,并以比堆叠架构更少的计算量将视频嵌入到二进制代码中。然后,我们鼓励二进制代码同时重建视频的视觉内容和邻域结构。在两个真实数据集上的实验表明,我们的 SSVH 方法可以显著优于最先进的方法,并在无监督视频检索任务上达到当前的最佳性能。