Luo Yang, Guo Xiqing, Dong Mingtao, Yu Jin
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China.
University of Chinese Academy of Sciences, Beijing 100040, China.
Sensors (Basel). 2023 Jul 22;23(14):6609. doi: 10.3390/s23146609.
RGB-T tracking involves the use of images from both visible and thermal modalities. The primary objective is to adaptively leverage the relatively dominant modality in varying conditions to achieve more robust tracking compared to single-modality tracking. An RGB-T tracker based on a mixed-attention mechanism to achieve a complementary fusion of modalities (referred to as MACFT) is proposed in this paper. In the feature extraction stage, we utilize different transformer backbone branches to extract specific and shared information from different modalities. By performing mixed-attention operations in the backbone to enable information interaction and self-enhancement between the template and search images, a robust feature representation is constructed that better understands the high-level semantic features of the target. Then, in the feature fusion stage, a modality shared-specific feature interaction structure was designed based on a mixed-attention mechanism, effectively suppressing low-quality modality noise while enhancing the information from the dominant modality. Evaluation on multiple RGB-T public datasets demonstrates that our proposed tracker outperforms other RGB-T trackers on general evaluation metrics while also being able to adapt to long-term tracking scenarios.
RGB-T跟踪涉及使用来自可见光和热成像两种模态的图像。其主要目标是在不同条件下自适应地利用相对占主导地位的模态,以实现比单模态跟踪更稳健的跟踪。本文提出了一种基于混合注意力机制以实现模态互补融合的RGB-T跟踪器(称为MACFT)。在特征提取阶段,我们利用不同的Transformer主干分支从不同模态中提取特定和共享信息。通过在主干中执行混合注意力操作,使模板图像和搜索图像之间能够进行信息交互和自我增强,构建了一种强大的特征表示,能更好地理解目标的高级语义特征。然后,在特征融合阶段,基于混合注意力机制设计了一种模态共享特定特征交互结构,有效抑制低质量模态噪声,同时增强来自主导模态的信息。在多个RGB-T公共数据集上的评估表明,我们提出的跟踪器在一般评估指标上优于其他RGB-T跟踪器,同时还能够适应长期跟踪场景。