• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于基于语音交互应用的抗噪多模态视听语音识别系统。

Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.

机构信息

Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.

出版信息

Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.

DOI:10.3390/s22207738
PMID:36298089
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9609693/
Abstract

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user-system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

摘要

语音是教育娱乐系统中常用的交互识别技术,也是实现顺畅教育学习和用户系统交互的关键技术。然而,由于真实环境中的各种噪声干扰,其在真实环境中的应用受到限制。本研究提出了一种基于音频和视觉信息的多模式交互系统,使使用语音进行交互的虚拟水族馆系统能够对环境噪声具有鲁棒性。对于基于音频的语音识别,使用语音 API 识别的单词列表使用预训练的模型表示为单词向量。同时,基于视觉的语音识别使用端到端复合深度神经网络。然后,在连接后对来自 API 和视觉的向量进行分类。根据来自四种噪声环境的数据确定了所提出系统的信噪比。此外,针对提取视觉特征和音频语音识别的现有单模式策略对其准确性和效率进行了测试。仅使用语音时,其平均识别率为 91.42%,而结合音频和视觉信息时,识别率提高了 6.7%,达到 98.12%。该方法可用于各种经常使用语音识别的真实环境,如咖啡馆、博物馆、音乐厅和信息亭。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/ccf7a3ab9b61/sensors-22-07738-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/5aacbfe910c0/sensors-22-07738-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/6edd15c0e36a/sensors-22-07738-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/287121e02d81/sensors-22-07738-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/b98a50ec71e8/sensors-22-07738-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/21a0054e9572/sensors-22-07738-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/84941ce3e4b3/sensors-22-07738-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/241c9c7be6e2/sensors-22-07738-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/80e58d017f66/sensors-22-07738-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/e2955d01e9f7/sensors-22-07738-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/95f95e805095/sensors-22-07738-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/117195ac9a4e/sensors-22-07738-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/31117f9b8436/sensors-22-07738-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/c660162173ea/sensors-22-07738-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/ea08347e43c6/sensors-22-07738-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/f6b86fd7106c/sensors-22-07738-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/ccf7a3ab9b61/sensors-22-07738-g015.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/5aacbfe910c0/sensors-22-07738-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/6edd15c0e36a/sensors-22-07738-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/287121e02d81/sensors-22-07738-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/b98a50ec71e8/sensors-22-07738-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/21a0054e9572/sensors-22-07738-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/84941ce3e4b3/sensors-22-07738-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/241c9c7be6e2/sensors-22-07738-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/80e58d017f66/sensors-22-07738-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/e2955d01e9f7/sensors-22-07738-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/95f95e805095/sensors-22-07738-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/117195ac9a4e/sensors-22-07738-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/31117f9b8436/sensors-22-07738-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/c660162173ea/sensors-22-07738-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/ea08347e43c6/sensors-22-07738-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/f6b86fd7106c/sensors-22-07738-g014.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7901/9609693/ccf7a3ab9b61/sensors-22-07738-g015.jpg

相似文献

1
Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.用于基于语音交互应用的抗噪多模态视听语音识别系统。
Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.
2
End-to-End Lip-Reading Open Cloud-Based Speech Architecture.端到端唇读开放云语音架构。
Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.
3
During Lipreading Training With Sentence Stimuli, Feedback Controls Learning and Generalization to Audiovisual Speech in Noise.在句子刺激的唇读训练中,反馈控制着在噪声中视听语音的学习和泛化。
Am J Audiol. 2022 Mar 3;31(1):57-77. doi: 10.1044/2021_AJA-21-00034. Epub 2021 Dec 29.
4
Vision-referential speech enhancement of an audio signal using mask information captured as visual data.利用视觉数据捕获的掩蔽信息进行音频信号的视觉参考语音增强。
J Acoust Soc Am. 2019 Jan;145(1):338. doi: 10.1121/1.5087563.
5
Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild.用于野外视听语音识别的具有深度学习的多模态传感器输入体系结构。
Sensors (Basel). 2023 Feb 7;23(4):1834. doi: 10.3390/s23041834.
6
A Hybrid Speech Enhancement Algorithm for Voice Assistance Application.一种用于语音助手应用的混合语音增强算法。
Sensors (Basel). 2021 Oct 23;21(21):7025. doi: 10.3390/s21217025.
7
Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data.通过对训练数据进行噪声增强来提高语音命令识别的抗噪声能力。
Sensors (Basel). 2020 Apr 19;20(8):2326. doi: 10.3390/s20082326.
8
Robust audio-visual speech recognition under noisy audio-video conditions.在嘈杂的音视频条件下进行稳健的视听语音识别。
IEEE Trans Cybern. 2014 Feb;44(2):175-84. doi: 10.1109/TCYB.2013.2250954.
9
Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face.观看深度神经网络生成的说话人脸可提高噪声环境下的言语理解能力。
Trends Hear. 2022 Jan-Dec;26:23312165221136934. doi: 10.1177/23312165221136934.
10
Correlation between audio-visual enhancement of speech in different noise environments and SNR: a combined behavioral and electrophysiological study.不同噪声环境下语音的视听增强与 SNR 的相关性:一项结合行为和电生理的研究。
Neuroscience. 2013 Sep 5;247:145-51. doi: 10.1016/j.neuroscience.2013.05.007. Epub 2013 May 11.

引用本文的文献

1
Audio-Visual Fusion Based on Interactive Attention for Person Verification.基于交互注意力的视听融合的人像验证。
Sensors (Basel). 2023 Dec 15;23(24):9845. doi: 10.3390/s23249845.

本文引用的文献

1
Learning the Relative Dynamic Features for Word-Level Lipreading.学习词级唇读的相对动态特征。
Sensors (Basel). 2022 May 13;22(10):3732. doi: 10.3390/s22103732.
2
End-to-End Lip-Reading Open Cloud-Based Speech Architecture.端到端唇读开放云语音架构。
Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.
3
An Effective Conversion of Visemes to Words for High-Performance Automatic Lipreading.可见语音到词的有效转换可实现高性能自动唇读。
Sensors (Basel). 2021 Nov 26;21(23):7890. doi: 10.3390/s21237890.
4
3D convolutional neural networks for human action recognition.三维卷积神经网络的人体动作识别。
IEEE Trans Pattern Anal Mach Intell. 2013 Jan;35(1):221-31. doi: 10.1109/TPAMI.2012.59.
5
The processing of audio-visual speech: empirical and neural bases.视听言语的处理:实证与神经基础。
Philos Trans R Soc Lond B Biol Sci. 2008 Mar 12;363(1493):1001-10. doi: 10.1098/rstb.2007.2155.
6
An audio-visual corpus for speech perception and automatic speech recognition.一个用于语音感知和自动语音识别的视听语料库。
J Acoust Soc Am. 2006 Nov;120(5 Pt 1):2421-4. doi: 10.1121/1.2229005.
7
Brain activity during audiovisual speech perception: an fMRI study of the McGurk effect.视听言语感知过程中的脑活动:一项关于麦格克效应的功能磁共振成像研究
Neuroreport. 2003 Jun 11;14(8):1129-33. doi: 10.1097/00001756-200306110-00006.
8
The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects.基频、幅度包络和发声持续时间线索对听力正常受试者唇读的贡献。
J Acoust Soc Am. 1985 Feb;77(2):671-7. doi: 10.1121/1.392335.
9
Single-channel vibrotactile supplements to visual perception of intonation and stress.单通道振动触觉对语调及重音视觉感知的补充。
J Acoust Soc Am. 1989 Jan;85(1):397-405. doi: 10.1121/1.397690.
10
Hearing lips and seeing voices.闻唇音而视语声。
Nature. 1976;264(5588):746-8. doi: 10.1038/264746a0.