基于端到端学习架构的混合深度表观-时间特征的视频人物再识别

Video-Based Person Re-Identification by an End-To-End Learning Architecture with Hybrid Deep Appearance-Temporal Feature.

机构信息

School of Computer Science and Information Engineering, Hefei University of Technology, Feicui Road 420, Hefei 230000, China.

出版信息

Sensors (Basel). 2018 Oct 29;18(11):3669. doi: 10.3390/s18113669.

DOI:10.3390/s18113669

PMID:30380623

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6263398/

Abstract

Video-based person re-identification is an important task with the challenges of lighting variation, low-resolution images, background clutter, occlusion, and human appearance similarity in the multi-camera visual sensor networks. In this paper, we propose a video-based person re-identification method called the end-to-end learning architecture with hybrid deep appearance-temporal feature. It can learn the appearance features of pivotal frames, the temporal features, and the independent distance metric of different features. This architecture consists of two-stream deep feature structure and two Siamese networks. For the first-stream structure, we propose the Two-branch Appearance Feature (TAF) sub-structure to obtain the appearance information of persons, and used one of the two Siamese networks to learn the similarity of appearance features of a pairwise person. To utilize the temporal information, we designed the second-stream structure that consisting of the Optical flow Temporal Feature (OTF) sub-structure and another Siamese network, to learn the person's temporal features and the distances of pairwise features. In addition, we select the pivotal frames of video as inputs to the Inception-V3 network on the Two-branch Appearance Feature sub-structure, and employ the salience-learning fusion layer to fuse the learned global and local appearance features. Extensive experimental results on the PRID2011, iLIDS-VID, and Motion Analysis and Re-identification Set (MARS) datasets showed that the respective proposed architectures reached 79%, 59% and 72% at Rank-1 and had advantages over state-of-the-art algorithms. Meanwhile, it also improved the feature representation ability of persons.

摘要

基于视频的人体再识别是一项具有挑战性的任务，面临着光照变化、低分辨率图像、背景杂乱、遮挡和多摄像机视觉传感器网络中人体外观相似等挑战。在本文中，我们提出了一种基于视频的人体再识别方法，称为端到端学习架构，具有混合深度外观-时间特征。它可以学习关键帧的外观特征、时间特征以及不同特征的独立距离度量。该架构由两流深度特征结构和两个孪生网络组成。对于第一流结构，我们提出了双分支外观特征（TAF）子结构来获取人体的外观信息，并使用两个孪生网络之一来学习成对人体外观特征的相似性。为了利用时间信息，我们设计了第二流结构，由光流时间特征（OTF）子结构和另一个孪生网络组成，以学习人体的时间特征和成对特征的距离。此外，我们选择视频的关键帧作为双分支外观特征子结构上的 Inception-V3 网络的输入，并采用显著性学习融合层来融合学习到的全局和局部外观特征。在 PRID2011、iLIDS-VID 和 Motion Analysis and Re-identification Set (MARS) 数据集上的广泛实验结果表明，各自提出的架构在 Rank-1 时分别达到了 79%、59%和 72%，并且优于最先进的算法。同时，它还提高了人体特征的表示能力。