步态-ViT：基于视觉Transformer 的步态识别。

Gait-ViT: Gait Recognition with Vision Transformer.

机构信息

Faculty of Information Science and Technology, Multimedia University, Melaka 75450, Malaysia.

出版信息

Sensors (Basel). 2022 Sep 28;22(19):7362. doi: 10.3390/s22197362.

DOI:10.3390/s22197362

PMID:36236462

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9572525/

Abstract

Identifying an individual based on their physical/behavioral characteristics is known as biometric recognition. Gait is one of the most reliable biometrics due to its advantages, such as being perceivable at a long distance and difficult to replicate. The existing works mostly leverage Convolutional Neural Networks for gait recognition. The Convolutional Neural Networks perform well in image recognition tasks; however, they lack the attention mechanism to emphasize more on the significant regions of the image. The attention mechanism encodes information in the image patches, which facilitates the model to learn the substantial features in the specific regions. In light of this, this work employs the Vision Transformer (ViT) with an attention mechanism for gait recognition, referred to as Gait-ViT. In the proposed Gait-ViT, the gait energy image is first obtained by averaging the series of images over the gait cycle. The images are then split into patches and transformed into sequences by flattening and patch embedding. Position embedding, along with patch embedding, are applied on the sequence of patches to restore the positional information of the patches. Subsequently, the sequence of vectors is fed to the Transformer encoder to produce the final gait representation. As for the classification, the first element of the sequence is sent to the multi-layer perceptron to predict the class label. The proposed method obtained 99.93% on CASIA-B, 100% on OU-ISIR D and 99.51% on OU-LP, which exhibit the ability of the Vision Transformer model to outperform the state-of-the-art methods.

摘要

基于个体的生理/行为特征进行身份识别被称为生物识别。步态是最可靠的生物识别技术之一，因为它具有远距离感知和难以复制的优势。现有的大多数工作都利用卷积神经网络进行步态识别。卷积神经网络在图像识别任务中表现出色；然而，它们缺乏注意力机制，无法更关注图像的重要区域。注意力机制对图像块中的信息进行编码，从而使模型能够学习特定区域中的重要特征。有鉴于此，这项工作采用了带有注意力机制的 Vision Transformer（ViT）进行步态识别，称为 Gait-ViT。在提出的 Gait-ViT 中，首先通过在步态周期内对一系列图像进行平均来获得步态能量图像。然后将图像分割成补丁，并通过扁平化和补丁嵌入将其转换为序列。位置嵌入与补丁嵌入一起应用于补丁序列，以恢复补丁的位置信息。随后，将序列的向量输入到 Transformer 编码器中，以生成最终的步态表示。对于分类，序列的第一个元素被发送到多层感知机以预测类别标签。该方法在 CASIA-B 上获得了 99.93%的准确率，在 OU-ISIR D 上获得了 100%的准确率，在 OU-LP 上获得了 99.51%的准确率，这表明 Vision Transformer 模型具有超越最先进方法的能力。