Suppr超能文献

用于基于语音交互应用的抗噪多模态视听语音识别系统。

Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.

机构信息

Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.

出版信息

Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.

Abstract

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

摘要

语音是教育娱乐系统中常用的交互识别技术,也是实现顺畅教育学习和用户系统交互的关键技术。然而,由于真实环境中的各种噪声干扰,其在真实环境中的应用受到限制。本研究提出了一种基于音频和视觉信息的多模式交互系统,使使用语音进行交互的虚拟水族馆系统能够对环境噪声具有鲁棒性。对于基于音频的语音识别,使用语音 API 识别的单词列表使用预训练的模型表示为单词向量。同时,基于视觉的语音识别使用端到端复合深度神经网络。然后,在连接后对来自 API 和视觉的向量进行分类。根据来自四种噪声环境的数据确定了所提出系统的信噪比。此外,针对提取视觉特征和音频语音识别的现有单模式策略对其准确性和效率进行了测试。仅使用语音时,其平均识别率为 91.42%,而结合音频和视觉信息时,识别率提高了 6.7%,达到 98.12%。该方法可用于各种经常使用语音识别的真实环境,如咖啡馆、博物馆、音乐厅和信息亭。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/5aacbfe910c0/sensors-22-07738-g0A1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验