STMMOT：通过时空记忆网络和多尺度注意力金字塔推进多目标跟踪

STMMOT: Advancing multi-object tracking through spatiotemporal memory networks and multi-scale attention pyramids.

作者信息

Mukhtar Hamza, Khan Muhammad Usman Ghani

机构信息

Department of Computer Science, University of Engineering and Technology Lahore, G.T. Road, Lahore, 54890, Punjab, Pakistan; Intelligent Criminology Lab, National Center of Artificial Intelligence, AlKhawarizmi Institute of Computer Science, University of Engineering and Technology, GT, Road, Lahore, 54890, Punjab, Pakistan.

出版信息

Neural Netw. 2023 Nov;168:363-379. doi: 10.1016/j.neunet.2023.09.047. Epub 2023 Sep 29.

DOI:10.1016/j.neunet.2023.09.047

PMID:37801917

Abstract

Multi-object Tracking (MOT) is very important in human surveillance, sports analytics, autonomous driving, and cooperative robots. Current MOT methods do not perform well in non-uniform movements, occlusion and appearance-reappearance scenarios. We introduce a comprehensive MOT method that seamlessly merges object detection and identity linkage within an end-to-end trainable framework, designed with the capability to maintain object links over a long period of time. Our proposed model, named STMMOT, is architectured around 4 key modules: (1) Candidate proposal creation network, generates object proposals via vision-Transformer encoder-decoder architecture; (2) Scale variant pyramid, progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; (3) Spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and (4) Spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical object state observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT resides in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with an attention mechanism and eradicates the need for post-processing. Experimental results show that STMMOT archives scores of 79.8 and 78.4 for IDF1, 79.3 and 74.1 for MOTA, 73.2 and 69.0 for HOTA, 61.2 and 61.5 for AssA, and maintained an ID switch count of 1529 and 1264 on MOT17 and MOT20, respectively. When evaluated on MOT20, it scored 78.4 in IDF1, 74.1 in MOTA, 69.0 in HOTA, and 61.5 in AssA, and kept the ID switch count to 1264. Compared with the previous best TransMOT, STMMOT achieves around a 4.58% and 4.25% increase in IDF1, and ID switching reduction to 5.79% and 21.05% on MOT17 and MOT20, respectively.

摘要

多目标跟踪（MOT）在人员监控、体育分析、自动驾驶和协作机器人领域非常重要。当前的MOT方法在非均匀运动、遮挡和外观重现场景中表现不佳。我们引入了一种全面的MOT方法，该方法在一个端到端可训练框架内无缝融合了目标检测和身份关联，设计具备长时间维持目标链接的能力。我们提出的模型名为STMMOT，围绕4个关键模块构建：（1）候选提议生成网络，通过视觉Transformer编码器-解码器架构生成目标提议；（2）尺度变体金字塔，一种渐进式金字塔结构，用于学习多尺度特征图中的自尺度和跨尺度相似性；（3）时空记忆编码器，从与每个正在跟踪的目标相关的记忆中提取关键信息；（4）时空记忆解码器，同时解决MOT的目标检测和身份关联任务。我们的系统利用了一个强大的时空记忆模块，该模块保留了广泛的历史目标状态观测，并使用基于注意力的聚合器对其进行有效编码。STMMOT的独特之处在于将目标表示为不断更新的动态查询嵌入，这使得能够通过注意力机制预测目标状态，并且无需后处理。实验结果表明，STMMOT在MOT17和MOT20上的IDF1得分分别为79.8和78.4，MOTA得分分别为79.3和74.1，HOTA得分分别为73.2和69.0，AssA得分分别为61.2和61.5，并且ID切换计数分别保持在1529和1264。在MOT20上进行评估时，其IDF1得分为78.4，MOTA得分为74.1，HOTA得分为69.0，AssA得分为61.5，并且ID切换计数保持在1264。与之前最好的TransMOT相比，STMMOT在MOT17和MOT20上的IDF1分别提高了约4.58%和4.25%，ID切换分别减少到5.79%和21.05%。