The College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China.
Management School, Shanghai University of International Business and Economics, Shanghai 201620, China.
Sensors (Basel). 2022 Sep 25;22(19):7264. doi: 10.3390/s22197264.
High-performing, real-time pose detection and tracking in real-time will enable computers to develop a finer-grained and more natural understanding of human behavior. However, the implementation of real-time human pose estimation remains a challenge. On the one hand, the performance of semantic keypoint tracking in live video footage requires high computational resources and large parameters, which limiting the accuracy of pose estimation. On the other hand, some transformer-based models were proposed recently with outstanding performance and much fewer parameters and FLOPs. However, the self-attention module in the transformer is not computationally friendly, which makes it difficult to apply these excellent models to real-time jobs. To overcome the above problems, we propose a transformer-like model, named ShiftPose, which is regression-based approach. The ShiftPose does not contain any self-attention module. Instead, we replace the self-attention module with a non-parameter operation called the shift operator. Meanwhile, we adapt the bridge-branch connection, instead of a fully-branched connection, such as HRNet, as our multi-resolution integration scheme. Specifically, the bottom half of our model adds the previous output, as well as the output from the top half of our model, corresponding to its resolution. Finally, the simple, yet promising, disentangled representation (SimDR) was used in our study to make the training process more stable. The experimental results on the MPII datasets were 86.4 PCKH, 29.1PCKH@0.1. On the COCO dataset, the results were 72.2 mAP and 91.5 AP50, 255 fps on GPU, with 10.2M parameters, and 1.6 GFLOPs. In addition, we tested our model for single-stage 3D human pose estimation and draw several useful and exploratory conclusions. The above results show good performance, and this paper provides a new method for high-performance, real-time attitude detection and tracking.
高性能、实时的姿态检测和跟踪将使计算机能够对人类行为有更精细、更自然的理解。然而,实时人体姿态估计的实现仍然是一个挑战。一方面,实时视频片段中语义关键点跟踪的性能需要高计算资源和大参数,这限制了姿态估计的准确性。另一方面,最近提出了一些基于转换器的模型,具有出色的性能和更少的参数和 FLOPs。然而,转换器中的自注意力模块在计算上不友好,这使得很难将这些优秀的模型应用于实时工作。为了克服上述问题,我们提出了一种类似于转换器的模型,名为 ShiftPose,它是一种基于回归的方法。ShiftPose 不包含任何自注意力模块。相反,我们用一种称为移位运算符的无参数操作来替换自注意力模块。同时,我们采用桥分支连接,而不是 HRNet 等全分支连接,作为我们的多分辨率集成方案。具体来说,我们模型的下半部分添加了前一个输出,以及模型上半部分的输出,对应于其分辨率。最后,我们在研究中使用了简单而有前途的解缠表示(SimDR),以使训练过程更加稳定。在 MPII 数据集上的实验结果为 86.4 PCKH,29.1PCKH@0.1。在 COCO 数据集上,结果分别为 72.2 mAP 和 91.5 AP50,GPU 上帧率为 255 fps,参数为 10.2M,FLOPs 为 1.6G。此外,我们还测试了我们的模型用于单阶段 3D 人体姿态估计,并得出了一些有用和探索性的结论。上述结果表明了良好的性能,本文为高性能、实时的姿态检测和跟踪提供了一种新方法。