CTT：CNN 与 Transformer 结合的目标跟踪方法。

CTT: CNN Meets Transformer for Tracking.

机构信息

Xi'an Institute of Optics and Precision Mechanics of CAS, Xi'an 710000, China.

University of Chinese Academy of Sciences, Beijing 100049, China.

出版信息

Sensors (Basel). 2022 Apr 22;22(9):3210. doi: 10.3390/s22093210.

DOI:10.3390/s22093210

PMID:35590900

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9105974/

Abstract

Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search branch, respectively. However, object tracking should focus on the global and contextual dependencies. Hence, we introduce a delicate residual transformer structure which contains a self-attention mechanism called encoder-decoder into our tracker as the part of neck. Under the encoder-decoder structure, the encoder promotes the interaction between the low-level features extracted from the target and search branch by the CNN to obtain global attention information, while the decoder replaces cross-correlation to send global attention information into the head module. We add a spatial and channel attention component in the target branch, which can further improve the accuracy and robustness of our proposed model for a low price. Finally, we detailly evaluate our tracker CTT on GOT-10k, VOT2019, OTB-100, LaSOT, NfS, UAV123 and TrackingNet benchmarks, and our proposed method obtains competitive results with the state-of-the-art algorithms.

摘要

孪生网络是基于深度学习的视觉目标跟踪中最流行的方向之一。在孪生网络中，特征金字塔网络（FPN）和互相关全特征融合以及模板和搜索分支提取的特征的匹配，分别。然而，目标跟踪应该关注全局和上下文的依赖关系。因此，我们在跟踪器中引入了一个精细的残差变换结构，作为颈部的一部分，其中包含一个称为编码器-解码器的自注意力机制。在编码器-解码器结构下，编码器通过 CNN 促进从目标和搜索分支提取的低层特征之间的交互，以获得全局注意信息，而解码器则用全局注意信息替换互相关，将其发送到头部模块。我们在目标分支中添加了一个空间和通道注意力组件，这可以进一步提高我们提出的模型的准确性和鲁棒性，而代价却很低。最后，我们在 GOT-10k、VOT2019、OTB-100、LaSOT、NfS、UAV123 和 TrackingNet 基准上详细评估了我们的跟踪器 CTT，我们提出的方法与最先进的算法相比获得了有竞争力的结果。