Zhang Tianlu, Jiao Qiang, Zhang Qiang, Han Jungong
IEEE Trans Image Process. 2024;33:4303-4318. doi: 10.1109/TIP.2024.3428316. Epub 2024 Jul 30.
In RGB-T tracking, there exist rich spatial relationships between the target and backgrounds within multi-modal data as well as sound consistencies of spatial relationships among successive frames, which are crucial for boosting the tracking performance. However, most existing RGB-T trackers overlook such multi-modal spatial relationships and temporal consistencies within RGB-T videos, hindering them from robust tracking and practical applications in complex scenarios. In this paper, we propose a novel Multi-modal Spatial-Temporal Context (MMSTC) network for RGB-T tracking, which employs a Transformer architecture for the construction of reliable multi-modal spatial context information and the effective propagation of temporal context information. Specifically, a Multi-modal Transformer Encoder (MMTE) is designed to achieve the encoding of reliable multi-modal spatial contexts as well as the fusion of multi-modal features. Furthermore, a Quality-aware Transformer Decoder (QATD) is proposed to effectively propagate the tracking cues from historical frames to the current frame, which facilitates the object searching process. Moreover, the proposed MMSTC network can be easily extended to various tracking frameworks. New state-of-the-art results on five prevalent RGB-T tracking benchmarks demonstrate the superiorities of our proposed trackers over existing ones.
在RGB-T跟踪中,多模态数据内目标与背景之间存在丰富的空间关系,以及连续帧之间空间关系的一致性,这对于提升跟踪性能至关重要。然而,大多数现有的RGB-T跟踪器忽略了RGB-T视频中的这种多模态空间关系和时间一致性,阻碍了它们在复杂场景中的鲁棒跟踪和实际应用。在本文中,我们提出了一种用于RGB-T跟踪的新型多模态时空上下文(MMSTC)网络,该网络采用Transformer架构来构建可靠的多模态空间上下文信息并有效传播时间上下文信息。具体而言,设计了一种多模态Transformer编码器(MMTE)来实现可靠的多模态空间上下文的编码以及多模态特征的融合。此外,还提出了一种质量感知Transformer解码器(QATD),以有效地将跟踪线索从历史帧传播到当前帧,这有助于目标搜索过程。此外,所提出的MMSTC网络可以轻松扩展到各种跟踪框架。在五个流行的RGB-T跟踪基准上取得的新的最优结果证明了我们提出的跟踪器相对于现有跟踪器的优越性。