SimpleTrackV2：重新思考多目标跟踪的时间特征

SimpleTrackV2: Rethinking the Timing Characteristics for Multi-Object Tracking.

作者信息

Ding Yan, Ling Yuchen, Zhang Bozhi, Li Jiaxin, Guo Lingxi, Yang Zhe

机构信息

Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education, School of Aerospace Engineering, Beijing Institute of Technology, Beijing 100081, China.

Science and Technology on Space Physics Laboratory, Beijing 100076, China.

出版信息

Sensors (Basel). 2024 Sep 17;24(18):6015. doi: 10.3390/s24186015.

DOI:10.3390/s24186015

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11436168/

Abstract

Multi-object tracking tasks aim to assign unique trajectory codes to targets in video frames. Most detection-based tracking methods use Kalman filtering algorithms for trajectory prediction, directly utilizing associated target features for trajectory updates. However, this approach often fails, with camera jitter and transient target loss in real-world scenarios. This paper rethinks state prediction and fusion based on target temporal features to address these issues and proposes the SimpleTrackV2 algorithm, building on the previously designed SimpleTrack. Firstly, to address the poor prediction performance of linear motion models in complex scenes, we designed a target state prediction algorithm called LSTM-MP, based on long short-term memory (LSTM). This algorithm encodes the target's historical motion information using LSTM and decodes it with a multilayer perceptron (MLP) to achieve target state prediction. Secondly, to mitigate the effect of occlusion on target state saliency, we designed a spatiotemporal attention-based target appearance feature fusion (TSA-FF) target state fusion algorithm based on the attention mechanism. TSA-FF calculates adaptive fusion coefficients to enhance target state fusion, thereby improving the accuracy of subsequent data association. To demonstrate the effectiveness of the proposed method, we compared SimpleTrackV2 with the baseline model SimpleTrack on the MOT17 dataset. We also conducted ablation experiments on TSA-FF and LSTM-MP for SimpleTrackV2, exploring the optimal number of fusion frames and the impact of different loss functions on model performance. The experimental results show that SimpleTrackV2 handles camera jitter and target occlusion better, achieving improvements of 1.6%, 3.2%, and 6.1% in MOTA, IDF1, and HOTA, respectively, compared to the SimpleTrack algorithm.

摘要

多目标跟踪任务旨在为视频帧中的目标分配唯一的轨迹代码。大多数基于检测的跟踪方法使用卡尔曼滤波算法进行轨迹预测，直接利用相关目标特征进行轨迹更新。然而，在现实场景中，由于相机抖动和目标的短暂丢失，这种方法常常失败。本文基于目标的时间特征重新思考状态预测和融合，以解决这些问题，并在先前设计的SimpleTrack基础上提出了SimpleTrackV2算法。首先，为了解决复杂场景中线性运动模型预测性能不佳的问题，我们设计了一种基于长短期记忆（LSTM）的目标状态预测算法，称为LSTM-MP。该算法使用LSTM对目标的历史运动信息进行编码，并用多层感知器（MLP）进行解码，以实现目标状态预测。其次，为了减轻遮挡对目标状态显著性的影响，我们基于注意力机制设计了一种基于时空注意力的目标外观特征融合（TSA-FF）目标状态融合算法。TSA-FF计算自适应融合系数以增强目标状态融合，从而提高后续数据关联的准确性。为了证明所提方法的有效性，我们在MOT17数据集上将SimpleTrackV2与基线模型SimpleTrack进行了比较。我们还对SimpleTrackV2的TSA-FF和LSTM-MP进行了消融实验，探索融合帧数的最佳值以及不同损失函数对模型性能的影响。实验结果表明，SimpleTrackV2在处理相机抖动和目标遮挡方面表现更好，与SimpleTrack算法相比，MOTA、IDF1和HOTA分别提高了1.6%、3.2%和6.1%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8c18/11436168/c0182e711250/sensors-24-06015-g001.jpg