College of Information Science and Engineering, Linyi University, Linyi 276000, China.
School of Physics and Electronic Engineering, Linyi University, Linyi 276005, China.
Sensors (Basel). 2022 Aug 31;22(17):6558. doi: 10.3390/s22176558.
Recently, the transformer model has progressed from the field of visual classification to target tracking. Its primary method replaces the cross-correlation operation in the Siamese tracker. The backbone of the network is still a convolutional neural network (CNN). However, the existing transformer-based tracker simply deforms the features extracted by the CNN into patches and feeds them into the transformer encoder. Each patch contains a single element of the spatial dimension of the extracted features and inputs into the transformer structure to use cross-attention instead of cross-correlation operations. This paper proposes a reconstruction patch strategy which combines the extracted features with multiple elements of the spatial dimension into a new patch. The reconstruction operation has the following advantages: (1) the correlation between adjacent elements combines well, and the features extracted by the CNN are usable for classification and regression; (2) using the performer operation reduces the amount of network computation and the dimension of the patch sent to the transformer, thereby sharply reducing the network parameters and improving the model-tracking speed.
最近,Transformer 模型已经从视觉分类领域发展到目标跟踪领域。其主要方法是取代了孪生跟踪器中的互相关操作。网络的主干仍然是卷积神经网络(CNN)。然而,现有的基于 Transformer 的跟踪器只是将 CNN 提取的特征变形为补丁,并将其输入到 Transformer 编码器中。每个补丁仅包含提取特征的空间维度的单个元素,并输入到 Transformer 结构中,以使用交叉注意代替互相关操作。本文提出了一种重建补丁策略,该策略将提取的特征与空间维度的多个元素组合成一个新的补丁。重建操作具有以下优点:(1)相邻元素之间的相关性很好地结合在一起,CNN 提取的特征可用于分类和回归;(2)使用性能器操作减少了网络计算量和发送到 Transformer 的补丁维度,从而大大减少了网络参数并提高了模型跟踪速度。