• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于深度神经网络的语音识别的多分辨率语音分析:在 TIMIT 上的实验。

Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

机构信息

AuDIaS - Audio, Data Intelligence and Speech, Universidad Autónoma de Madrid, Madrid, Spain.

出版信息

PLoS One. 2018 Oct 10;13(10):e0205355. doi: 10.1371/journal.pone.0205355. eCollection 2018.

DOI:10.1371/journal.pone.0205355
PMID:30304055
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6179252/
Abstract

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.

摘要

语音分析通常从短时傅里叶变换(STFT)开始,这意味着在时间-频率分辨率权衡中选择一个固定点。这种方法与梅尔频率标度滤波器组和离散余弦变换相结合,产生梅尔频率倒谱系数(MFCC),这是过去几十年语音处理中最常见的语音特征。这些特征特别适合以前的隐马尔可夫模型/高斯混合模型(HMM/GMM)在 ASR 中的最新技术。特别是它们产生了非常不相关的小维数特征(通常为 13 个系数加上增量和双增量),这对于对角协方差 GMM 非常方便,可以处理维度的诅咒,并且适用于十年前有限的计算资源。目前,大多数 ASR 系统使用深度神经网络(DNN)代替 GMM 来对声学特征进行建模,这为特征的定义提供了更大的灵活性。特别是,声学特征可以高度相关,并且可以更大,因为 DNN 在处理高维输入方面非常强大。此外,计算硬件已经达到了一个发展水平,使得语音处理中的计算成本不再是一个相关问题。在这种情况下,我们决定重新审视语音分析中的时频分辨率问题,特别是检查多分辨率语音分析(时间和频率)是否有助于使用 DNN 改进声学建模。我们的实验从几个 Kaldi 基线系统开始,这些系统针对著名的 TIMIT 语料库,通过添加多分辨率语音表示来修改它们,这些表示通过连接使用不同时间-频率分辨率计算的不同频谱,以及使用不同时间-频率分辨率的不同后处理和说话人自适应特征来实现。我们的实验表明,使用多分辨率语音表示通常优于使用基线单分辨率语音表示的结果,这似乎证实了我们的主要假设。然而,将多分辨率与高度后处理和说话人自适应特征相结合的结果,这些特征在 Kaldi 中为 TIMIT 提供了最佳结果,仅产生了非常适度的改进。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/2d2620725865/pone.0205355.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/967b5255f5e8/pone.0205355.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/761bde69929a/pone.0205355.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/fcd9b86dbc17/pone.0205355.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/a27b2fe0b848/pone.0205355.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/2d2620725865/pone.0205355.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/967b5255f5e8/pone.0205355.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/761bde69929a/pone.0205355.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/fcd9b86dbc17/pone.0205355.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/a27b2fe0b848/pone.0205355.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b3b0/6179252/2d2620725865/pone.0205355.g005.jpg

相似文献

1
Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.基于深度神经网络的语音识别的多分辨率语音分析:在 TIMIT 上的实验。
PLoS One. 2018 Oct 10;13(10):e0205355. doi: 10.1371/journal.pone.0205355. eCollection 2018.
2
Statistical modeling of speech Poincaré sections in combination of frequency analysis to improve speech recognition performance.联合频率分析的语音庞加莱截面的统计建模以提高语音识别性能。
Chaos. 2010 Sep;20(3):033106. doi: 10.1063/1.3463722.
3
Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.基于机器学习的方言阿萨姆语语音自动识别样本提取。
Neural Netw. 2016 Jun;78:97-111. doi: 10.1016/j.neunet.2015.12.010. Epub 2015 Dec 30.
4
Predicting fundamental frequency from mel-frequency cepstral coefficients to enable speech reconstruction.从梅尔频率倒谱系数预测基频以实现语音重构。
J Acoust Soc Am. 2005 Aug;118(2):1134-43. doi: 10.1121/1.1953269.
5
Phonetic variability constrained bottleneck features for joint speaker recognition and physical task stress detection.用于联合说话人识别和身体任务压力检测的语音变异受限瓶颈特征
J Acoust Soc Am. 2020 Nov;148(5):2912. doi: 10.1121/10.0002455.
6
Analysis and prediction of acoustic speech features from mel-frequency cepstral coefficients in distributed speech recognition architectures.分布式语音识别架构中基于梅尔频率倒谱系数的声学语音特征分析与预测
J Acoust Soc Am. 2008 Dec;124(6):3989-4000. doi: 10.1121/1.2997436.
7
Deep neural architectures for dialect classification with single frequency filtering and zero-time windowing feature representations.用于方言分类的深度神经架构,具有单频滤波和零时间窗特征表示。
J Acoust Soc Am. 2022 Feb;151(2):1077. doi: 10.1121/10.0009405.
8
Analysis of COVID-19 Resulting Cough Using Formants and Automatic Speech Recognition System.基于共振峰和自动语音识别系统分析 COVID-19 导致的咳嗽。
J Voice. 2023 Nov;37(6):971.e9-971.e16. doi: 10.1016/j.jvoice.2021.05.015. Epub 2021 Jun 15.
9
Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data.利用发音运动数据识别接受喉部手术重建的个体所发出的低语语音。
Workshop Speech Lang Process Assist Technol. 2016 Sep;2016:80-86. doi: 10.21437/SLPAT.2016-14.
10
An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition.基于瓶颈特征的语言识别中深度神经网络(DNN)拓扑结构的影响分析
PLoS One. 2017 Aug 10;12(8):e0182580. doi: 10.1371/journal.pone.0182580. eCollection 2017.

引用本文的文献

1
Neural representation of consciously seen and unseen information.有意识看到和未看到信息的神经表征。
Sci Rep. 2025 Mar 6;15(1):7888. doi: 10.1038/s41598-025-92490-y.
2
Cognitive Computing-Based CDSS in Medical Practice.医学实践中基于认知计算的临床决策支持系统
Health Data Sci. 2021 Jul 22;2021:9819851. doi: 10.34133/2021/9819851. eCollection 2021.
3
Progress in diagnosis of bone metastasis of prostate cancer.前列腺癌骨转移的诊断进展。

本文引用的文献

1
An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition.基于瓶颈特征的语言识别中深度神经网络(DNN)拓扑结构的影响分析
PLoS One. 2017 Aug 10;12(8):e0182580. doi: 10.1371/journal.pone.0182580. eCollection 2017.
2
Deep learning.深度学习。
Nature. 2015 May 28;521(7553):436-44. doi: 10.1038/nature14539.
3
A fast learning algorithm for deep belief nets.一种用于深度信念网络的快速学习算法。
Zhong Nan Da Xue Xue Bao Yi Xue Ban. 2021 Oct 28;46(10):1147-1152. doi: 10.11817/j.issn.1672-7347.2021.200999.
4
Allosteric Regulation at the Crossroads of New Technologies: Multiscale Modeling, Networks, and Machine Learning.新技术交叉点上的变构调节:多尺度建模、网络与机器学习
Front Mol Biosci. 2020 Jul 9;7:136. doi: 10.3389/fmolb.2020.00136. eCollection 2020.
Neural Comput. 2006 Jul;18(7):1527-54. doi: 10.1162/neco.2006.18.7.1527.
4
Framewise phoneme classification with bidirectional LSTM and other neural network architectures.使用双向长短期记忆网络和其他神经网络架构进行逐帧音素分类。
Neural Netw. 2005 Jun-Jul;18(5-6):602-10. doi: 10.1016/j.neunet.2005.06.042.