Feng Wei, Wang Feifan, Han Ruize, Gan Yiyang, Qian Zekun, Hou Junhui, Wang Song
IEEE Trans Pattern Anal Mach Intell. 2025 Jan;47(1):351-368. doi: 10.1109/TPAMI.2024.3463966. Epub 2024 Dec 4.
Multi-view multi-human association and tracking (MvMHAT), is an emerging yet important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self-learning. In this work, we tackle this problem with an end-to-end neural network in a self-supervised learning manner. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry, and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public.
多视角多人关联与跟踪(MvMHAT)是多人场景视频监控中一个新兴但重要的问题,旨在在每个视角下随时间跟踪一群人,同时在同一时间跨不同视角识别同一个人,这与之前仅考虑随时间进行人体跟踪的多目标跟踪(MOT)和多摄像机多目标跟踪任务不同。通过这种方式,MvMHAT的视频需要更复杂的标注,同时包含更多用于自学习的信息。在这项工作中,我们以自监督学习的方式用端到端神经网络解决这个问题。具体来说,我们建议通过考虑自反性、对称性和传递性这三个属性来利用时空自一致性原理。除了自然成立的自反性属性外,我们基于对称性和传递性属性设计自监督学习损失,用于外观特征学习和分配矩阵优化,以随时间和跨视角关联多个人。此外,为了推动对MvMHAT的研究,我们为不同算法的网络训练和测试构建了两个新的大规模基准。在所提出的基准上进行的大量实验验证了我们方法的有效性。我们已将基准和代码公开发布。