Ma Chengming, Liu Qian, Dang Yaqi
College of Communication, Northwest Normal University, Lanzhou, China.
Front Psychol. 2021 Nov 8;12:769509. doi: 10.3389/fpsyg.2021.769509. eCollection 2021.
This paper provides an in-depth study and analysis of human artistic poses through intelligently enhanced multimodal artistic pose recognition. A complementary network model architecture of multimodal information based on motion energy proposed. The network exploits both the rich information of appearance features provided by RGB data and the depth information provided by depth data as well as the characteristics of robustness to luminance and observation angle. The multimodal fusion is accomplished by the complementary information characteristics of the two modalities. Moreover, to better model the long-range temporal structure while considering action classes with sub-action sharing phenomena, an energy-guided video segmentation method is employed. And in the feature fusion stage, a cross-modal cross-fusion approach is proposed, which enables the convolutional network to share local features of two modalities not only in the shallow layer but also to obtain the fusion of global features in the deep convolutional layer by connecting the feature maps of multiple convolutional layers. Firstly, the Kinect camera is used to acquire the color image data of the human body, the depth image data, and the 3D coordinate data of the skeletal points using the Open pose open-source framework. Then, the action automatically extracted from keyframes based on the distance between the hand and the head, and the relative distance features are extracted from the keyframes to describe the action, the local occupancy pattern features and HSV color space features are extracted to describe the object, and finally, the feature fusion is performed and the complex action recognition task is completed. To solve the consistency problem of virtual-reality fusion, the mapping relationship between hand joint point coordinates and the virtual scene is determined in the augmented reality scene, and the coordinate consistency model of natural hand and virtual model is established; finally, the real-time interaction between hand gesture and virtual model is realized, and the average correct rate of its hand gesture reaches 99.04%, which improves the robustness and real-time interaction of hand gesture recognition.
本文通过智能增强的多模态艺术姿态识别对人体艺术姿态进行了深入研究与分析。提出了一种基于运动能量的多模态信息互补网络模型架构。该网络利用RGB数据提供的丰富外观特征信息、深度数据提供的深度信息以及对亮度和观察角度的鲁棒性特征。多模态融合通过两种模态的互补信息特征来完成。此外,为了在考虑具有子动作共享现象的动作类别时更好地对长程时间结构进行建模,采用了一种能量引导的视频分割方法。并且在特征融合阶段,提出了一种跨模态交叉融合方法,该方法使卷积网络不仅能在浅层共享两种模态的局部特征,还能通过连接多个卷积层的特征图在深度卷积层获得全局特征的融合。首先,使用Kinect相机,借助Open pose开源框架获取人体的彩色图像数据、深度图像数据以及骨骼点的3D坐标数据。然后,基于手与头部之间的距离从关键帧中自动提取动作,并从关键帧中提取相对距离特征来描述动作,提取局部占用模式特征和HSV颜色空间特征来描述对象,最后进行特征融合并完成复杂的动作识别任务。为了解决虚拟现实融合的一致性问题,在增强现实场景中确定手部关节点坐标与虚拟场景之间的映射关系,建立自然手与虚拟模型的坐标一致性模型;最终实现手势与虚拟模型的实时交互,其手势平均正确率达到99.04%,提高了手势识别的鲁棒性和实时交互性。