CVTrack：用于视觉跟踪的卷积神经网络与视觉Transformer融合模型

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking.

作者信息

Wang Jian, Song Yueming, Song Ce, Tian Haonan, Zhang Shuai, Sun Jinghui

机构信息

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China.

University of Chinese Academy of Sciences, Beijing 101408, China.

出版信息

Sensors (Basel). 2024 Jan 3;24(1):274. doi: 10.3390/s24010274.

DOI:10.3390/s24010274

PMID:38203136

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10781369/

Abstract

Most single-object trackers currently employ either a convolutional neural network (CNN) or a vision transformer as the backbone for object tracking. In CNNs, convolutional operations excel at extracting local features but struggle to capture global representations. On the other hand, vision transformers utilize cascaded self-attention modules to capture long-range feature dependencies but may overlook local feature details. To address these limitations, we propose a target-tracking algorithm called CVTrack, which leverages a parallel dual-branch backbone network combining CNN and Transformer for feature extraction and fusion. Firstly, CVTrack utilizes a parallel dual-branch feature extraction network with CNN and transformer branches to extract local and global features from the input image. Through bidirectional information interaction channels, the local features from the CNN branch and the global features from the transformer branch are able to interact and fuse information effectively. Secondly, deep cross-correlation operations and transformer-based methods are employed to fuse the template and search region features, enabling comprehensive interaction between them. Subsequently, the fused features are fed into the prediction module to accomplish the object-tracking task. Our tracker achieves state-of-the-art performance on five benchmark datasets while maintaining real-time execution speed. Finally, we conduct ablation studies to demonstrate the efficacy of each module in the parallel dual-branch feature extraction backbone network.

摘要

目前，大多数单目标跟踪器采用卷积神经网络（CNN）或视觉Transformer作为目标跟踪的主干。在CNN中，卷积操作擅长提取局部特征，但难以捕捉全局表征。另一方面，视觉Transformer利用级联自注意力模块来捕捉长距离特征依赖，但可能会忽略局部特征细节。为了解决这些局限性，我们提出了一种名为CVTrack的目标跟踪算法，该算法利用结合了CNN和Transformer的并行双分支主干网络进行特征提取和融合。首先，CVTrack利用具有CNN和Transformer分支的并行双分支特征提取网络从输入图像中提取局部和全局特征。通过双向信息交互通道，CNN分支的局部特征和Transformer分支的全局特征能够有效地交互和融合信息。其次，采用深度互相关操作和基于Transformer的方法来融合模板和搜索区域特征，使它们之间能够进行全面交互。随后，将融合后的特征输入到预测模块中以完成目标跟踪任务。我们的跟踪器在五个基准数据集上实现了领先的性能，同时保持实时执行速度。最后，我们进行了消融研究，以证明并行双分支特征提取主干网络中每个模块的有效性。