IEEE Trans Pattern Anal Mach Intell. 2021 Aug;43(8):2794-2808. doi: 10.1109/TPAMI.2020.2974726. Epub 2021 Jul 1.
Reliable markerless motion tracking of people participating in a complex group activity from multiple moving cameras is challenging due to frequent occlusions, strong viewpoint and appearance variations, and asynchronous video streams. To solve this problem, reliable association of the same person across distant viewpoints and temporal instances is essential. We present a self-supervised framework to adapt a generic person appearance descriptor to the unlabeled videos by exploiting motion tracking, mutual exclusion constraints, and multi-view geometry. The adapted discriminative descriptor is used in a tracking-by-clustering formulation. We validate the effectiveness of our descriptor learning on WILDTRACK T. Chavdarova et al., "WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5030-5039. and three new complex social scenes captured by multiple cameras with up to 60 people "in the wild". We report significant improvement in association accuracy (up to 18 percent) and stable and coherent 3D human skeleton tracking (5 to 10 times) over the baseline. Using the reconstructed 3D skeletons, we cut the input videos into a multi-angle video where the image of a specified person is shown from the best visible front-facing camera. Our algorithm detects inter-human occlusion to determine the camera switching moment while still maintaining the flow of the action well. Website: http://www.cs.cmu.edu/~ILIM/projects/IM/Association4Tracking.
由于频繁的遮挡、强烈的视角和外观变化以及异步视频流,从多个移动摄像机对参与复杂群体活动的人进行可靠的无标记运动跟踪是具有挑战性的。为了解决这个问题,可靠地关联同一人在不同视角和时间实例上的身份是至关重要的。我们提出了一个自监督框架,通过利用运动跟踪、相互排斥约束和多视角几何,将通用的人体外观描述符自适应到未标记的视频中。适应后的判别描述符用于基于聚类的跟踪。我们在 WILDTRACK 上验证了我们的描述符学习的有效性,T. Chavdarova 等人,“WILDTRACK:用于密集无脚本行人检测的多摄像机高清数据集”,在 IEEE 计算机视觉与模式识别会议上,2018 年,第 5030-5039 页。以及三个新的复杂社会场景,这些场景由多台摄像机拍摄,最多有 60 人“在野外”。我们报告了关联准确性(高达 18%)的显著提高,以及稳定和连贯的 3D 人体骨骼跟踪(5 到 10 倍),超过了基线。使用重建的 3D 骨骼,我们将输入视频切成多角度视频,其中指定人员的图像从最佳可见的正面摄像机显示。我们的算法检测到人与人之间的遮挡,以确定摄像机切换的时刻,同时仍能很好地保持动作的流畅性。网站:http://www.cs.cmu.edu/~ILIM/projects/IM/Association4Tracking。