Choi Young Rok, Kil Rhee Man
IEEE Trans Image Process. 2021;30:1015-1029. doi: 10.1109/TIP.2020.3040847. Epub 2020 Dec 9.
This paper presents a novel framework to extract highly compact and discriminative features for face video retrieval tasks using the deep convolutional neural network (CNN). The face video retrieval task is to find the videos containing the face of a specific person from a database with a face image or a face video of the same person as a query. A key challenge is to extract discriminative features with small storage space from face videos with large intra-class variations caused by different angle, illumination, and facial expression. In recent years, the CNN-based binary hashing and metric learning methods showed notable progress in image/video retrieval tasks. However, the existing CNN-based binary hashing and metric learning have limitations in terms of inevitable information loss and storage inefficiency, respectively. To cope with these problems, the proposed framework consists of two parts: first, a novel loss function using a radial basis function kernel (RBF Loss) is introduced to train a neural network to generate compact and discriminative high-level features, and secondly, an optimized quantization using a logistic function (Logistic Quantization) is suggested to convert a real-valued feature to a 1-byte integer with the minimum information loss. Through the face video retrieval experiments on a challenging TV series data set (ICT-TV), it is demonstrated that the proposed framework outperforms the existing state-of-the-art feature extraction methods. Furthermore, the effectiveness of RBF loss was also demonstrated through the image classification and retrieval experiments on the CIFAR-10 and Fashion-MNIST data sets with LeNet-5.
本文提出了一种新颖的框架,用于使用深度卷积神经网络(CNN)为面部视频检索任务提取高度紧凑且具有判别力的特征。面部视频检索任务是从数据库中找到包含特定人物面部的视频,该数据库以同一人物的面部图像或面部视频作为查询。一个关键挑战是从因不同角度、光照和面部表情而具有较大类内变化的面部视频中提取具有小存储空间的判别性特征。近年来,基于CNN的二进制哈希和度量学习方法在图像/视频检索任务中取得了显著进展。然而,现有的基于CNN的二进制哈希和度量学习分别在不可避免的信息损失和存储效率方面存在局限性。为了解决这些问题,所提出的框架由两部分组成:首先,引入一种使用径向基函数核的新颖损失函数(RBF损失)来训练神经网络,以生成紧凑且具有判别力的高级特征;其次,建议使用逻辑函数的优化量化(逻辑量化)将实值特征转换为具有最小信息损失的1字节整数。通过在具有挑战性的电视剧数据集(ICT-TV)上进行的面部视频检索实验,证明了所提出的框架优于现有的最先进特征提取方法。此外,通过使用LeNet-5在CIFAR-10和Fashion-MNIST数据集上进行的图像分类和检索实验,也证明了RBF损失的有效性。