College of Electronic and Information Engineering, Shandong University of Science and Technology, Qingdao, 266590, Shandong, China.
Sci Rep. 2022 Oct 24;12(1):17812. doi: 10.1038/s41598-022-21636-z.
In sign language video, the hand region is small, the resolution is low, the motion speed is fast, and there are cross occlusion and blur phenomena, which have a great impact on sign language recognition rate and speed, and are important factors restricting sign language recognition performance. To solve these problems, this paper proposes an improved 3D-ResNet sign language recognition algorithm with enhanced hand features, aiming to highlight the features of both hands, solve the problem of missing more effective information when relying only on global features, and improve the accuracy of sign language recognition. The proposed method has two improvements. Firstly, the algorithm detects the left and right hand regions based on the improved EfficientDet network, uses the improved Bi-FPN module and dual channel and spatial attention module are used to enhance the detection ability of the network for small targets like hand. Secondly, the improved residual module is used to improve the 3D-ResNet18 network to extract sign language features. The global, the left-hand and the right-hand image sequences are divided into three branches for feature extraction and fusion, so as to strengthen the attention to hand features, strengthen the representation ability of sign language features, and achieve the purpose of improving the accuracy of sign language recognition. In order to verify the performance of this algorithm, a series of experiments are carried out on CSL dataset. For example, in the experiments of hand detection algorithm and sign language recognition algorithm, the performance indicators such as Top-N, mAP, FLOPs and Parm are applied to find the optimal algorithm framework. The experimental results show that the Top1 recognition accuracy of this algorithm reaches 91.12%, which is more than 10% higher than that of C3D, P3D and 3D-ResNet basic networks. From the performance indicators of Top-N, mAP, FLOPs, Parm and so on, the performance of the algorithm in this paper is better than several algorithms in recent three years, such as I3D+BLSTM, B3D ResNet, AM-ResC3D+RCNN and so on. The results show that the hand detection network with enhanced hand features and three-dimensional convolutional neural network proposed in this paper can achieve higher accuracy of sign language recognition.
在手语视频中,手部区域较小,分辨率较低,运动速度较快,并且存在交叉遮挡和模糊现象,这对口译识别率和速度有很大影响,是限制口译识别性能的重要因素。为了解决这些问题,本文提出了一种改进的基于增强手部特征的 3D-ResNet 手语识别算法,旨在突出双手的特征,解决仅依赖全局特征时会丢失更多有效信息的问题,并提高手语识别的准确性。
所提出的方法有两个改进。首先,该算法基于改进的 EfficientDet 网络检测左右手部区域,使用改进的 Bi-FPN 模块和双通道和空间注意力模块来增强网络对手部等小目标的检测能力。其次,使用改进的残差模块改进 3D-ResNet18 网络以提取手语特征。将全局、左手和右手图像序列分为三个分支进行特征提取和融合,从而加强对手部特征的关注,增强手语特征的表示能力,达到提高手语识别准确性的目的。
为了验证该算法的性能,在 CSL 数据集上进行了一系列实验。例如,在手语检测算法和手语识别算法的实验中,应用 Top-N、mAP、FLOPs 和 Parm 等性能指标来寻找最佳算法框架。实验结果表明,该算法的 Top1 识别准确率达到 91.12%,比 C3D、P3D 和 3D-ResNet 基本网络高出 10%以上。从 Top-N、mAP、FLOPs、 Parm 等性能指标来看,本文算法的性能优于近三年的 I3D+BLSTM、B3D ResNet、AM-ResC3D+RCNN 等几种算法。结果表明,本文提出的增强手部特征的手语检测网络和三维卷积神经网络可以实现更高的手语识别准确率。