Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.
Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.
Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.
深度学习技术促进了抗噪自动语音识别(ASR)的研究。云计算技术和人工智能的结合,显著提高了基于云的开放式语音识别应用程序编程接口(OCSR API)的性能。正在开发用于不同环境的抗噪 ASR。本研究提出了基于端到端唇读架构的抗噪 OCSR API,以实现在各种环境中的实际应用。使用 Google Voice Command Dataset v2 评估了包括 Google、Microsoft、Amazon 和 Naver 在内的几个 OCSR API,以获得最佳性能。基于性能,将 Microsoft API 与 Google 训练的 word2vec 模型集成,以用更完整的语义信息增强关键字。提取的单词向量与所提出的唇读架构集成,用于视听语音识别。所提出的唇读架构中使用了三种卷积神经网络(3D CNN、3D 密集连接 CNN 和多层 3D CNN)。在连接后对从 API 和视觉中提取的向量进行分类。使用标准的 ASR 评估指标和信噪比,所提出的架构将 OCSR API 的平均准确率提高了 14.42%。所提出的模型在各种噪声环境下表现出更好的性能,提高了 OCSR API 在实际应用中的可靠性。