Shanghai Engineering Research Center of Assistive Devices, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China.
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China.
Sensors (Basel). 2021 Apr 29;21(9):3094. doi: 10.3390/s21093094.
Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.
最近,随着手机等摄像工具的普及和各类短视频平台的兴起,大量视频随时被上传到互联网上,因此非常需要一个检索速度快、精度高的视频检索系统。因此,基于内容的视频检索(CBVR)引起了许多研究人员的兴趣。一个典型的 CBVR 系统主要包含以下两个基本部分:视频特征提取和相似性比较。视频特征提取非常具有挑战性,之前的视频检索方法大多基于从单个视频帧中提取特征,而导致视频中的时间信息丢失。哈希方法由于其检索效率而在多媒体信息检索中得到广泛应用,但目前大多数方法仅应用于图像检索。为了解决视频检索中的这些问题,我们构建了一个名为深度监督视频哈希(DSVH)的端到端框架,该框架使用 3D 卷积神经网络(CNN)获取视频的时空特征,然后通过监督哈希训练一组哈希函数,将视频特征转换到二进制空间,并获得视频的紧凑二进制码。最后,我们使用三元组损失进行网络训练。我们在三个公共视频数据集 UCF-101、JHMDB 和 HMDB-51 上进行了大量实验,结果表明,所提出的方法优于许多最新的视频检索方法。与 DVH 方法相比,UCF-101 数据集的 mAP 值提高了 9.3%,JHMDB 数据集的最小改进也提高了 0.3%。同时,我们还在 HMDB-51 数据集上展示了算法的稳定性。