Cheng Yu, Wang Bo, Tan Robby T
IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):1636-1651. doi: 10.1109/TPAMI.2022.3170353. Epub 2023 Jan 6.
Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our evaluations demonstrate the effectiveness of the proposed method and its individual components. Our code and pretrained models are available publicly: https://github.com/3dpose/3D-Multi-Person-Pose.
单目3D人体姿态估计近年来取得了进展。大多数方法都聚焦于单人,它们在以人体为中心的坐标中估计姿态,即以目标人物的中心为基础的坐标。因此,这些方法不适用于多人3D姿态估计,而多人3D姿态估计需要绝对坐标(例如相机坐标)。此外,由于人与人之间的遮挡以及密切的人际互动,多人姿态估计比单人姿态估计更具挑战性。现有的自上而下的多人方法依赖于人体检测(即自上而下的方法),因此会受到检测误差的影响,并且在多人场景中无法产生可靠的姿态估计。同时,现有的不使用人体检测的自下而上的方法不受检测误差的影响,但由于它们一次性处理场景中的所有人,所以容易出错,特别是对于小尺寸的人。为了应对所有这些挑战,我们提出将自上而下和自下而上的方法相结合以发挥它们的优势。我们的自上而下网络从图像块中的所有人而不是一个人估计人体关节,使其对可能错误的边界框具有鲁棒性。我们的自下而上网络结合了基于人体检测的归一化热图,使网络在处理尺度变化时更具鲁棒性。最后,将自上而下和自下而上网络估计的3D姿态输入到我们的集成网络中以获得最终的3D姿态。为了解决训练和测试数据之间的常见差距,我们在测试时进行优化,通过使用高阶时间约束、重投影损失和骨骼长度正则化来细化估计的3D人体姿态。我们还引入了一个双人姿态判别器来强化自然的双人互动。最后,我们应用一种半监督方法来克服3D真实数据的稀缺性。我们的评估证明了所提出方法及其各个组件的有效性。我们的代码和预训练模型可公开获取:https://github.com/3dpose/3D-Multi-Person-Pose 。