Zhuoyue Honors College, Hangzhou Dianzi University, Hangzhou, China.
College of Automation, Hangzhou Dianzi University, Hangzhou, China.
Math Biosci Eng. 2023 Jan;20(2):2110-2130. doi: 10.3934/mbe.2023098. Epub 2022 Nov 14.
In the traditional person re-identification model, the CNN network is usually used for feature extraction. When converting the feature map into a feature vector, a large number of convolution operations are used to reduce the size of the feature map. In CNN, since the receptive field of the latter layer is obtained by convolution operation on the feature map of the previous layer, the size of this local receptive field is limited, and the computational cost is large. For these problems, combined with the self-attention characteristics of Transformer, an end-to-end person re-identification model (twinsReID) is designed that integrates feature information between levels in this article. For Transformer, the output of each layer is the correlation between its previous layer and other elements. This operation is equivalent to the global receptive field because each element needs to calculate the correlation with other elements, and the calculation is simple, so its cost is small. From these perspectives, Transformer has certain advantages over CNN's convolution operation. This paper uses Twins-SVT Transformer to replace the CNN network, combines the features extracted from the two different stages and divides them into two branches. First, convolve the feature map to obtain a fine-grained feature map, perform global adaptive average pooling on the second branch to obtain the feature vector. Then divide the feature map level into two sections, perform global adaptive average pooling on each. These three feature vectors are obtained and sent to the Triplet Loss respectively. After sending the feature vectors to the fully connected layer, the output is input to the Cross-Entropy Loss and Center-Loss. The model is verified On the Market-1501 dataset in the experiments. The mAP/rank1 index reaches 85.4%/93.7%, and reaches 93.6%/94.9% after reranking. The statistics of the parameters show that the parameters of the model are less than those of the traditional CNN model.
在传统的人像再识别模型中,通常使用 CNN 网络进行特征提取。在将特征图转换为特征向量时,会使用大量卷积操作来减小特征图的大小。在 CNN 中,由于后一层的感受野是通过在前一层特征图上进行卷积操作获得的,因此这个局部感受野的大小是有限的,计算成本也很大。针对这些问题,结合 Transformer 的自注意力特点,设计了一个端到端的人像再识别模型(twinsReID),本文在这个模型中整合了各层之间的特征信息。对于 Transformer,每一层的输出都是其前一层与其他元素之间的相关性。这种操作相当于全局感受野,因为每个元素都需要与其他元素计算相关性,而且计算简单,因此其成本较小。从这些方面来看,Transformer 相对于 CNN 的卷积操作具有一定的优势。本文使用 Twins-SVT Transformer 替换了 CNN 网络,将从两个不同阶段提取的特征结合起来并分为两个分支。首先,对特征图进行卷积以获得细粒度的特征图,然后在第二分支上进行全局自适应平均池化以获得特征向量。然后将特征图的级别分为两部分,对每一部分进行全局自适应平均池化。得到这三个特征向量,分别将它们发送到三元组损失中。将特征向量发送到全连接层后,输出输入到交叉熵损失和中心损失中。在实验中,该模型在 Market-1501 数据集上进行了验证。mAP/rank1 指标达到 85.4%/93.7%,重新排序后达到 93.6%/94.9%。参数统计显示,模型的参数少于传统的 CNN 模型。