Wang Wuwei, Zhang Ke, Su Yu, Wang Jingyu, Wang Qi
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):15156-15169. doi: 10.1109/TNNLS.2023.3282905. Epub 2024 Oct 29.
In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent "CNN + Transformer" trackers on various benchmarks while our ATST requires significantly less training data.
在过去几年中,基于卷积神经网络(CNN)的视觉跟踪方法广受欢迎并取得了巨大成功。然而,CNN的卷积操作难以关联空间上距离较远的信息,这限制了跟踪器的辨别能力。最近,出现了几种基于Transformer辅助的跟踪方法,通过将CNN与Transformer相结合来增强特征表示,以缓解上述问题。与上述方法不同,本文探索了一种具有新颖半连体架构的纯基于Transformer的模型。用于构建特征提取主干的时空自注意力模块和用于估计响应图的交叉注意力鉴别器都仅利用注意力而不使用卷积。受近期视觉Transformer(ViT)的启发,我们提出了多级交替时空Transformer(ATST)来学习鲁棒的特征表示。具体而言,每个阶段的时间和空间令牌由单独的Transformer交替提取和编码。随后,提出了一种交叉注意力鉴别器,无需额外的预测头或相关滤波器即可直接生成搜索区域的响应图。实验结果表明,我们基于ATST的模型在与当前最先进的卷积跟踪器对比中取得了良好的结果。此外,在各种基准测试中,它与近期的“CNN + Transformer”跟踪器表现相当,而我们的ATST所需的训练数据要少得多。