Zhang Jianqiang, Hou Jing, He Qiusheng, Yuan Zhengwei, Xue Hao
School of Electronic Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China.
College of Modern Urban Construction Industry, Tianjin Chengjian University, Tianjin 300384, China.
Sensors (Basel). 2024 Dec 20;24(24):8158. doi: 10.3390/s24248158.
Human pose estimation is an important research direction in the field of computer vision, which aims to accurately identify the position and posture of keypoints of the human body through images or videos. However, multi-person pose estimation yields false detection or missed detection in dense crowds, and it is still difficult to detect small targets. In this paper, we propose a Mamba-based human pose estimation. First, we design a GMamba structure to be used as a backbone network to extract human keypoints. A gating mechanism is introduced into the linear layer of Mamba, which allows the model to dynamically adjust the weights according to the different input images to locate the human keypoints more precisely. Secondly, GMamba as the backbone network can effectively solve the long-sequence problem. The direct use of convolutional downsampling reduces selectivity for different stages of information flow. We used slice downsampling (SD) to reduce the resolution of the feature map to half the original size, and then fused local features from four different locations. The fusion of multi-channel information helped the model obtain rich pose information. Finally, we introduced an adaptive threshold focus loss (ATFL) to dynamically adjust the weights of different keypoints. We assigned higher weights to error-prone keypoints to strengthen the model's attention to these points. Thus, we effectively improved the accuracy of keypoint identification in cases of occlusion, complex background, etc., and significantly improved the overall performance of attitude estimation and anti-interference ability. Experimental results showed that the AP and AP50 of the proposed algorithm on the COCO 2017 validation set were 72.2 and 92.6. Compared with the typical algorithm, it was improved by 1.1% on AP50. The proposed method can effectively detect the keypoints of the human body, and provides stronger robustness and accuracy for the estimation of human posture in complex scenes.
人体姿态估计是计算机视觉领域的一个重要研究方向,其目的是通过图像或视频准确识别人体关键点的位置和姿态。然而,多人姿态估计在密集人群中会产生误检或漏检,并且检测小目标仍然困难。在本文中,我们提出了一种基于曼巴的人体姿态估计方法。首先,我们设计了一种GMamba结构作为骨干网络来提取人体关键点。在曼巴的线性层中引入了一种门控机制,使模型能够根据不同的输入图像动态调整权重,从而更精确地定位人体关键点。其次,GMamba作为骨干网络能够有效解决长序列问题。直接使用卷积下采样会降低对信息流不同阶段的选择性。我们使用切片下采样(SD)将特征图的分辨率降低到原来的一半,然后融合来自四个不同位置的局部特征。多通道信息的融合有助于模型获得丰富的姿态信息。最后,我们引入了自适应阈值焦点损失(ATFL)来动态调整不同关键点的权重。我们给易错的关键点分配更高的权重,以加强模型对这些点的关注。因此,我们有效提高了在遮挡、复杂背景等情况下关键点识别的准确率,并显著提高了姿态估计的整体性能和抗干扰能力。实验结果表明,所提算法在COCO 2017验证集上的AP和AP50分别为72.2和92.6。与典型算法相比,AP50提高了1.1%。所提方法能够有效检测人体关键点,并为复杂场景下的人体姿态估计提供更强的鲁棒性和准确性。