Liu Hongmin, Zhang Canbin, Fan Bin, Xu Jinglin
IEEE Trans Image Process. 2024;33:6508-6520. doi: 10.1109/TIP.2024.3494600.
Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.
多目标跟踪(MOT)旨在估计视频中物体的边界框和ID标签。该任务中的一个具有挑战性的问题是缓解检测和跟踪子任务之间的竞争学习,为此,两阶段检测跟踪(TBD)分别优化这两个子任务,而单阶段联合检测与跟踪(JDT)则在端到端的流程中精细调整复杂的网络架构。在本文中,我们提出了一种新的MOT方法,即通过扩散模型进行提议传播,称为Pro2Diff,它将扩散模型集成到多目标跟踪的提议传播中,重点关注模型训练过程而非复杂的网络设计。具体而言,Pro2Diff采用生成方法,在前向过程中为跟踪图像序列生成大量有噪声的提议,随后,Pro2Diff学习这些有噪声的提议与被跟踪物体实际边界框之间的差异,逐步优化这些有噪声的提议以获得真实跟踪物体的初始序列。通过将去噪扩散过程引入多目标跟踪,我们有了三个进一步的重要发现:1)生成方法可以有效地处理多目标跟踪任务;2)无需修改模型结构,我们提出了自条件提议传播以在推理过程中有效提高模型性能;3)通过针对不同的跟踪序列适当调整提议数量和迭代次数,可以实现模型的最佳性能。在MOT17和DanceTrack数据集上的大量实验结果表明,Pro2Diff优于当前的端到端多目标跟踪方法。我们在DanceTrack上达到了61.9的HOTA,在MOT17上达到了57.6的HOTA,达到了JDT方法的竞争结果。