• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

端到端唇读开放云语音架构。

End-to-End Lip-Reading Open Cloud-Based Speech Architecture.

机构信息

Center for Healthcare Robotics, Gwangju Institute of Science and Technology (GIST), School of Integrated Technology, Gwangju 61005, Korea.

出版信息

Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.

DOI:10.3390/s22082938
PMID:35458932
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9029225/
Abstract

Deep learning technology has encouraged research on noise-robust automatic speech recognition (ASR). The combination of cloud computing technologies and artificial intelligence has significantly improved the performance of open cloud-based speech recognition application programming interfaces (OCSR APIs). Noise-robust ASRs for application in different environments are being developed. This study proposes noise-robust OCSR APIs based on an end-to-end lip-reading architecture for practical applications in various environments. Several OCSR APIs, including Google, Microsoft, Amazon, and Naver, were evaluated using the Google Voice Command Dataset v2 to obtain the optimum performance. Based on performance, the Microsoft API was integrated with Google's trained word2vec model to enhance the keywords with more complete semantic information. The extracted word vector was integrated with the proposed lip-reading architecture for audio-visual speech recognition. Three forms of convolutional neural networks (3D CNN, 3D dense connection CNN, and multilayer 3D CNN) were used in the proposed lip-reading architecture. Vectors extracted from API and vision were classified after concatenation. The proposed architecture enhanced the OCSR API average accuracy rate by 14.42% using standard ASR evaluation measures along with the signal-to-noise ratio. The proposed model exhibits improved performance in various noise settings, increasing the dependability of OCSR APIs for practical applications.

摘要

深度学习技术促进了抗噪自动语音识别(ASR)的研究。云计算技术和人工智能的结合,显著提高了基于云的开放式语音识别应用程序编程接口(OCSR API)的性能。正在开发用于不同环境的抗噪 ASR。本研究提出了基于端到端唇读架构的抗噪 OCSR API,以实现在各种环境中的实际应用。使用 Google Voice Command Dataset v2 评估了包括 Google、Microsoft、Amazon 和 Naver 在内的几个 OCSR API,以获得最佳性能。基于性能,将 Microsoft API 与 Google 训练的 word2vec 模型集成,以用更完整的语义信息增强关键字。提取的单词向量与所提出的唇读架构集成,用于视听语音识别。所提出的唇读架构中使用了三种卷积神经网络(3D CNN、3D 密集连接 CNN 和多层 3D CNN)。在连接后对从 API 和视觉中提取的向量进行分类。使用标准的 ASR 评估指标和信噪比,所提出的架构将 OCSR API 的平均准确率提高了 14.42%。所提出的模型在各种噪声环境下表现出更好的性能,提高了 OCSR API 在实际应用中的可靠性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/f9c77205be67/sensors-22-02938-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/97539c90cd89/sensors-22-02938-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/50efc48b045f/sensors-22-02938-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/78d633acef74/sensors-22-02938-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/ceaf1283d789/sensors-22-02938-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/2408fa1742d9/sensors-22-02938-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/713af58bda12/sensors-22-02938-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/bfb1875424eb/sensors-22-02938-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/d21b21503422/sensors-22-02938-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/c9cf1787abc6/sensors-22-02938-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/ad8751d7c06e/sensors-22-02938-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/0d06b34fd575/sensors-22-02938-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/9fedbf257d2c/sensors-22-02938-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/f9c77205be67/sensors-22-02938-g013.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/97539c90cd89/sensors-22-02938-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/50efc48b045f/sensors-22-02938-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/78d633acef74/sensors-22-02938-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/ceaf1283d789/sensors-22-02938-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/2408fa1742d9/sensors-22-02938-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/713af58bda12/sensors-22-02938-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/bfb1875424eb/sensors-22-02938-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/d21b21503422/sensors-22-02938-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/c9cf1787abc6/sensors-22-02938-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/ad8751d7c06e/sensors-22-02938-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/0d06b34fd575/sensors-22-02938-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/9fedbf257d2c/sensors-22-02938-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e146/9029225/f9c77205be67/sensors-22-02938-g013.jpg

相似文献

1
End-to-End Lip-Reading Open Cloud-Based Speech Architecture.端到端唇读开放云语音架构。
Sensors (Basel). 2022 Apr 12;22(8):2938. doi: 10.3390/s22082938.
2
Accuracy of Cloud-Based Speech Recognition Open Application Programming Interface for Medical Terms of Korean.基于云的语音识别开放式应用程序编程接口对韩语医学术语的准确性。
J Korean Med Sci. 2022 May 9;37(18):e144. doi: 10.3346/jkms.2022.37.e144.
3
Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.用于基于语音交互应用的抗噪多模态视听语音识别系统。
Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.
4
Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.基于多个卷积神经网络的句子级唇读识别架构。
Sensors (Basel). 2021 Dec 23;22(1):72. doi: 10.3390/s22010072.
5
Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data.通过对训练数据进行噪声增强来提高语音命令识别的抗噪声能力。
Sensors (Basel). 2020 Apr 19;20(8):2326. doi: 10.3390/s20082326.
6
Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.基于机器学习的方言阿萨姆语语音自动识别样本提取。
Neural Netw. 2016 Jun;78:97-111. doi: 10.1016/j.neunet.2015.12.010. Epub 2015 Dec 30.
7
Discriminative analysis of lip motion features for speaker identification and speech-reading.用于说话人识别和语音阅读的唇部运动特征判别分析。
IEEE Trans Image Process. 2006 Oct;15(10):2879-91. doi: 10.1109/tip.2006.877528.
8
SNR-adaptive stream weighting for audio-MES ASR.用于音频MES自动语音识别的信噪比自适应流加权
IEEE Trans Biomed Eng. 2008 Aug;55(8):2001-10. doi: 10.1109/TBME.2008.921094.
9
Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild.用于野外视听语音识别的具有深度学习的多模态传感器输入体系结构。
Sensors (Basel). 2023 Feb 7;23(4):1834. doi: 10.3390/s23041834.
10
Transfer of Learning from Vision to Touch: A Hybrid Deep Convolutional Neural Network for Visuo-Tactile 3D Object Recognition.从视觉到触觉的迁移学习:用于视触 3D 物体识别的混合深度卷积神经网络。
Sensors (Basel). 2020 Dec 27;21(1):113. doi: 10.3390/s21010113.

引用本文的文献

1
Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications.用于基于语音交互应用的抗噪多模态视听语音识别系统。
Sensors (Basel). 2022 Oct 12;22(20):7738. doi: 10.3390/s22207738.

本文引用的文献

1
Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition.基于多个卷积神经网络的句子级唇读识别架构。
Sensors (Basel). 2021 Dec 23;22(1):72. doi: 10.3390/s22010072.
2
3D convolutional neural networks for human action recognition.三维卷积神经网络的人体动作识别。
IEEE Trans Pattern Anal Mach Intell. 2013 Jan;35(1):221-31. doi: 10.1109/TPAMI.2012.59.
3
The processing of audio-visual speech: empirical and neural bases.视听言语的处理:实证与神经基础。
Philos Trans R Soc Lond B Biol Sci. 2008 Mar 12;363(1493):1001-10. doi: 10.1098/rstb.2007.2155.
4
An audio-visual corpus for speech perception and automatic speech recognition.一个用于语音感知和自动语音识别的视听语料库。
J Acoust Soc Am. 2006 Nov;120(5 Pt 1):2421-4. doi: 10.1121/1.2229005.
5
Brain activity during audiovisual speech perception: an fMRI study of the McGurk effect.视听言语感知过程中的脑活动:一项关于麦格克效应的功能磁共振成像研究
Neuroreport. 2003 Jun 11;14(8):1129-33. doi: 10.1097/00001756-200306110-00006.
6
The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal-hearing subjects.基频、幅度包络和发声持续时间线索对听力正常受试者唇读的贡献。
J Acoust Soc Am. 1985 Feb;77(2):671-7. doi: 10.1121/1.392335.
7
Single-channel vibrotactile supplements to visual perception of intonation and stress.单通道振动触觉对语调及重音视觉感知的补充。
J Acoust Soc Am. 1989 Jan;85(1):397-405. doi: 10.1121/1.397690.
8
Hearing lips and seeing voices.闻唇音而视语声。
Nature. 1976;264(5588):746-8. doi: 10.1038/264746a0.
9
The role of vision in the perception of speech.视觉在言语感知中的作用。
Perception. 1977;6(1):31-40. doi: 10.1068/p060031.