Li Bo, Peng Fengguang, Hui Tianrui, Wei Xiaoming, Wei Xiaolin, Zhang Lijun, Shi Hang, Liu Si
IEEE Trans Pattern Anal Mach Intell. 2025 Jan;47(1):634-649. doi: 10.1109/TPAMI.2024.3475472. Epub 2024 Dec 4.
The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target's appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances.
RGB-热成像(RGB-T)跟踪的目标是利用RGB和热红外(TIR)模态的协同和互补优势,以增强在各种情况下的跟踪效果,其中跨模态交互是一个关键要素。早期的方法通常只是简单地将RGB和TIR搜索帧的特征组合起来,导致交互粗糙,还引入了不必要的背景噪声。许多其他方法从搜索帧中采样候选框,并对RGB和TIR框的各个对应用不同的融合技术,这将跨模态交互限制在局部区域,导致上下文建模不足。此外,在RGB-T跟踪中,挖掘视频时间上下文也未得到充分探索。为了缓解这些限制,我们提出了一种新颖的模板桥接搜索区域交互(TBSI)模块,该模块利用模板作为媒介,通过收集和分布与目标相关的物体和环境上下文来桥接RGB和TIR搜索区域之间的跨模态交互。设计了一种光照引导融合(IGF)模块,以利用全局光照因子自适应地融合RGB和TIR搜索区域令牌。此外,在推理阶段,我们还提出了一种有效的目标保留模板更新(TPTU)策略,利用视频序列中的时间上下文来适应目标的外观变化。我们提出的模块被集成到一个视觉Transformer(ViT)主干中,用于联合特征提取、搜索模板匹配和跨模态交互。在三个流行的RGB-T跟踪基准上进行的大量实验表明,我们的方法取得了新的最优性能。