Pan Fei, Zhao Lianyu, Wang Chenglin
School of Computer Science and Engineering, Tianjin University of Technology, Liqizhaung street, Tianjin, 300384, China.
School of Mechanical Engineering, Tianjin University of Technology, Liqizhaung street, Tianjin, 300384, China.
Sci Rep. 2024 May 28;14(1):12256. doi: 10.1038/s41598-024-63028-5.
The Transformer-based Siamese networks have excelled in the field of object tracking. Nevertheless, a notable limitation persists in their reliance on ResNet as backbone, which lacks the capacity to effectively capture global information and exhibits constraints in feature representation. Furthermore, these trackers struggle to effectively attend to target-relevant information within the search region using multi-head self-attention (MSA). Additionally, they are prone to robustness challenges during online tracking and tend to exhibit significant model complexity. To address these limitations, We propose a novel tracker named ASACTT, which includes a backbone network, feature fusion network and prediction head. First, we improve the Swin-Transformer-Tiny to enhance its global information extraction capabilities. Second, we propose an adaptive sparse attention (ASA) to focus on target-specific details within the search region. Third, we leverage position encoding and historical candidate data to develop a dynamic template updater (DTU), which ensures the preservation of the initial frame's integrity while gracefully adapting to variations in the target's appearance. Finally, we optimize the network model to maintain accuracy while minimizing complexity. To verify the effectiveness of our proposed tracker, ASACTT, experiments on five benchmark datasets demonstrated that the proposed tracker was highly comparable to other state-of-the-art methods. Notably, in the GOT-10K evaluation, our tracker achieved an outstanding success score of 75.3% at 36 FPS, significantly surpassing other trackers with comparable model parameters.
基于Transformer的孪生网络在目标跟踪领域表现出色。然而,它们存在一个显著的局限性,即依赖ResNet作为骨干网络,该网络缺乏有效捕捉全局信息的能力,并且在特征表示方面存在限制。此外,这些跟踪器难以使用多头自注意力(MSA)有效地关注搜索区域内与目标相关的信息。此外,它们在在线跟踪过程中容易面临鲁棒性挑战,并且往往表现出显著的模型复杂性。为了解决这些局限性,我们提出了一种名为ASACTT的新型跟踪器,它包括一个骨干网络、特征融合网络和预测头。首先,我们改进了Swin-Transformer-Tiny以增强其全局信息提取能力。其次,我们提出了一种自适应稀疏注意力(ASA),以聚焦于搜索区域内特定于目标的细节。第三,我们利用位置编码和历史候选数据开发了一种动态模板更新器(DTU),它在优雅地适应目标外观变化的同时确保初始帧的完整性得以保留。最后,我们优化网络模型以在保持准确性的同时最小化复杂性。为了验证我们提出的跟踪器ASACTT的有效性,在五个基准数据集上进行的实验表明,该跟踪器与其他现有最先进方法具有高度可比性。值得注意的是,在GOT-10K评估中,我们的跟踪器在36帧每秒的速度下取得了75.3%的出色成功率,显著超过了具有可比模型参数的其他跟踪器。