Liu Jiaming, Wu Yue, Miao Qiguang, Gong Maoguo, Kong Linghe
IEEE Trans Pattern Anal Mach Intell. 2025 Sep;47(9):8148-8164. doi: 10.1109/TPAMI.2025.3581381.
3D Single Object Tracking (SOT) plays an important role in real-world visual applications such as autonomous driving and planning. How to realize effective 3D SOT is still a valuable challenge due to its carrier-sparse point clouds and its role-complex influencing factors. Inspired by the remote modeling of popular transformers, we further propose a Versatile Point Tracking Transformer (VPTT) method for 3D SOT, with object guidance from the template point cloud to the search area point cloud under the siamese-based tracking paradigm. Specifically, VPTT employs self- and cross- attention mechanisms and extends four matching operations, resulting in leveraging the contextual information of consecutive frames to improve the tracking results. By constructing a deep network VerFormer consisting of four successive transformer layers, which performs matching operations involving fusional transformation, separative discrimination, intersectional interaction, and unidirectional propagation from shallow to deep. Considering that the tracking task involves multiple processes, VPTT further learns how to forecast intermediate outputs including mask probability, trailing distance, and heading angle at each stage. Such a specialized design allows our VPTT to revisit the end-to-end training paradigm used for 3D tracking while developing a versatile transformer that is a perfect fit for the 3D SOT task. Experiments on three benchmarks, KITTI, nuScenes, and Waymo, show that VPTT achieves state-of-the-art tracking performance on siamese-based tracking running at $\sim$∼62 FPS.
三维单目标跟踪(SOT)在自动驾驶和规划等实际视觉应用中发挥着重要作用。由于其载体稀疏点云以及复杂的影响因素,如何实现有效的三维SOT仍然是一个具有挑战性的问题。受流行的变压器远程建模启发,我们进一步提出了一种用于三维SOT的通用点跟踪变压器(VPTT)方法,在基于暹罗的跟踪范式下,从模板点云到搜索区域点云进行目标引导。具体而言,VPTT采用自注意力和交叉注意力机制,并扩展了四种匹配操作,从而利用连续帧的上下文信息来改善跟踪结果。通过构建一个由四个连续变压器层组成的深度网络VerFormer,该网络执行涉及融合变换、分离判别、交叉交互和从浅到深的单向传播的匹配操作。考虑到跟踪任务涉及多个过程,VPTT进一步学习如何预测每个阶段的中间输出,包括掩码概率、跟踪距离和航向角。这种专门的设计使我们的VPTT能够重新审视用于三维跟踪的端到端训练范式,同时开发一种非常适合三维SOT任务的通用变压器。在KITTI、nuScenes和Waymo三个基准上的实验表明,VPTT在基于暹罗的跟踪中以约62帧每秒的速度实现了领先的跟踪性能。