IEEE Trans Image Process. 2022;31:2782-2795. doi: 10.1109/TIP.2022.3161081. Epub 2022 Apr 4.
Human detection and pose estimation are essential for understanding human activities in images and videos. Mainstream multi-human pose estimation methods take a top-down approach, where human detection is first performed, then each detected person bounding box is fed into a pose estimation network. This top-down approach suffers from the early commitment of initial detections in crowded scenes and other cases with ambiguities or occlusions, leading to pose estimation failures. We propose the DetPoseNet, an end-to-end multi-human detection and pose estimation framework in a unified three-stage network. Our method consists of a coarse-pose proposal extraction sub-net, a coarse-pose based proposal filtering module, and a multi-scale pose refinement sub-net. The coarse-pose proposal sub-net extracts whole-body bounding boxes and body keypoint proposals in a single shot. The coarse-pose filtering step based on the person and keypoint proposals can effectively rule out unlikely detections, thus improving subsequent processing. The pose refinement sub-net performs cascaded pose estimation on each refined proposal region. Multi-scale supervision and multi-scale regression are used in the pose refinement sub-net to simultaneously strengthen context feature learning. Structure-aware loss and keypoint masking are applied to further improve the pose refinement robustness. Our framework is flexible to accept most existing top-down pose estimators as the role of the pose refinement sub-net in our approach. Experiments on COCO and OCHuman datasets demonstrate the effectiveness of the proposed framework. The proposed method is computationally efficient (5-6x speedup) in estimating multi-person poses with refined bounding boxes in sub-seconds.
人体检测和姿态估计对于理解图像和视频中的人体活动至关重要。主流的多人姿态估计方法采用自上而下的方法,首先进行人体检测,然后将每个检测到的人体边界框输入到姿态估计网络中。这种自上而下的方法存在早期承诺的问题,即在拥挤场景和其他存在歧义或遮挡的情况下,初始检测的结果会导致姿态估计失败。我们提出了 DetPoseNet,这是一个统一的三阶段网络中的端到端多人检测和姿态估计框架。我们的方法由一个粗姿态提议提取子网络、一个基于粗姿态的提议过滤模块和一个多尺度姿态细化子网络组成。粗姿态提议提取子网络在单次操作中提取全身边界框和身体关键点提议。基于人体和关键点提议的粗姿态过滤步骤可以有效地排除不太可能的检测结果,从而提高后续处理的效果。姿态细化子网络对每个细化的提议区域进行级联姿态估计。多尺度监督和多尺度回归被用于姿态细化子网络中,以同时加强上下文特征学习。结构感知损失和关键点掩模被应用于进一步提高姿态细化的鲁棒性。我们的框架具有灵活性,可以接受大多数现有的自上而下的姿态估计器作为我们方法中姿态细化子网络的角色。在 COCO 和 OCHuman 数据集上的实验证明了所提出框架的有效性。所提出的方法在亚秒级内计算效率高(速度提高 5-6 倍),能够精确定位多人姿态。