• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在匹配和不匹配条件下开发顺序训练的鲁棒旁遮普语语音识别系统。

Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions.

作者信息

Bawa Puneet, Kadyan Virender, Tripathy Abinash, Singh Thipendra P

机构信息

Centre of Excellence for Speech and Multimodal Laboratory, Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India.

Speech and Language Research Centre, School of Computer Science, University of Petroleum and Energy Studies (UPES), Energy Acres, Bidholi, Dehradun, Uttarakhand 248007 India.

出版信息

Complex Intell Systems. 2023;9(1):1-23. doi: 10.1007/s40747-022-00651-7. Epub 2022 Jun 2.

DOI:10.1007/s40747-022-00651-7
PMID:35668730
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9160864/
Abstract

Development of a native language robust ASR framework is very challenging as well as an active area of research. Although an urge for investigation of effective front-end as well as back-end approaches are required for tackling environment differences, large training complexity and inter-speaker variability in achieving success of a recognition system. In this paper, four front-end approaches: mel-frequency cepstral coefficients (MFCC), Gammatone frequency cepstral coefficients (GFCC), relative spectral-perceptual linear prediction (RASTA-PLP) and power-normalized cepstral coefficients (PNCC) have been investigated to generate unique and robust feature vectors at different SNR values. Furthermore, to handle the large training data complexity, parameter optimization has been performed with sequence-discriminative training techniques: maximum mutual information (MMI), minimum phone error (MPE), boosted-MMI (bMMI), and state-level minimum Bayes risk (sMBR). It has been demonstrated by selection of an optimal value of parameters using lattice generation, and adjustments of learning rates. In proposed framework, four different systems have been tested by analyzing various feature extraction approaches (with or without speaker normalization through Vocal Tract Length Normalization (VTLN) approach in test set) and classification strategy on with or without artificial extension of train dataset. To compare each system performance, true matched (adult train and test-S1, child train and test-S2) and mismatched (adult train and child test-S3, adult + child train and child test-S4) systems on large adult and very small Punjabi clean speech corpus have been demonstrated. Consequently, gender-based in-domain data augmented is used to moderate acoustic and phonetic variations throughout adult and children's speech under mismatched conditions. The experiment result shows that an effective framework developed on PNCC + VTLN front-end approach using TDNN-sMBR-based model through parameter optimization technique yields a relative improvement (RI) of 40.18%, 47.51%, and 49.87% in matched, mismatched and gender-based in-domain augmented system under typical clean and noisy conditions, respectively.

摘要

开发一个强大的母语自动语音识别(ASR)框架极具挑战性,同时也是一个活跃的研究领域。尽管为应对环境差异、巨大的训练复杂性以及说话者之间的变异性以实现识别系统的成功,需要迫切研究有效的前端和后端方法。在本文中,研究了四种前端方法:梅尔频率倒谱系数(MFCC)、伽马通频率倒谱系数(GFCC)、相对谱感知线性预测(RASTA-PLP)和功率归一化倒谱系数(PNCC),以在不同信噪比(SNR)值下生成独特且强大的特征向量。此外,为处理大量训练数据的复杂性,使用序列判别训练技术进行了参数优化:最大互信息(MMI)、最小音素错误(MPE)、增强型MMI(bMMI)和状态级最小贝叶斯风险(sMBR)。通过使用格生成选择参数的最优值以及调整学习率,已证明了这一点。在所提出的框架中,通过分析各种特征提取方法(在测试集中有无通过声道长度归一化(VTLN)方法进行说话者归一化)以及在有无人工扩展训练数据集的情况下的分类策略,对四个不同的系统进行了测试。为比较每个系统的性能,在大型成人和非常小的旁遮普语纯净语音语料库上展示了真匹配(成人训练和测试-S1,儿童训练和测试-S2)和不匹配(成人训练和儿童测试-S3,成人+儿童训练和儿童测试-S4)系统。因此,基于性别的域内数据增强被用于在不匹配条件下缓解成人和儿童语音中的声学和语音变化。实验结果表明,通过参数优化技术在基于PNCC + VTLN前端方法上使用基于TDNN-sMBR的模型开发的有效框架,在典型的纯净和噪声条件下,在匹配、不匹配和基于性别的域内增强系统中分别产生了40.18%、47.51%和49.87%的相对改进(RI)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/314b6a9e4bcc/40747_2022_651_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/e64ebb8e2d60/40747_2022_651_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/a2f32edba1b8/40747_2022_651_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/8da6afbdffc7/40747_2022_651_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/0f8f23655459/40747_2022_651_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/006f40468cc4/40747_2022_651_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/237957c2849d/40747_2022_651_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/fcda498ae941/40747_2022_651_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/314b6a9e4bcc/40747_2022_651_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/e64ebb8e2d60/40747_2022_651_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/a2f32edba1b8/40747_2022_651_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/8da6afbdffc7/40747_2022_651_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/0f8f23655459/40747_2022_651_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/006f40468cc4/40747_2022_651_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/237957c2849d/40747_2022_651_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/fcda498ae941/40747_2022_651_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2094/9160864/314b6a9e4bcc/40747_2022_651_Fig8_HTML.jpg

相似文献

1
Developing sequentially trained robust Punjabi speech recognition system under matched and mismatched conditions.在匹配和不匹配条件下开发顺序训练的鲁棒旁遮普语语音识别系统。
Complex Intell Systems. 2023;9(1):1-23. doi: 10.1007/s40747-022-00651-7. Epub 2022 Jun 2.
2
Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection.在用于抑郁症检测的多语音任务刺激中结合说话人嵌入的集成学习。
Front Neurosci. 2023 Mar 23;17:1141621. doi: 10.3389/fnins.2023.1141621. eCollection 2023.
3
A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation.基于异质分类器融合与互补特征协作的两级说话人识别系统。
Sensors (Basel). 2021 Jul 28;21(15):5097. doi: 10.3390/s21155097.
4
Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion.无监督说话人自适应的说话人无关声学到发音语音反转。
J Acoust Soc Am. 2019 Jul;146(1):316. doi: 10.1121/1.5116130.
5
Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition.两步联合优化与辅助损失函数的噪声鲁棒语音识别。
Sensors (Basel). 2022 Jul 19;22(14):5381. doi: 10.3390/s22145381.
6
A bio-inspired feature extraction for robust speech recognition.一种用于稳健语音识别的受生物启发的特征提取方法。
Springerplus. 2014 Nov 4;3:651. doi: 10.1186/2193-1801-3-651. eCollection 2014.
7
Stressed Speech Emotion Recognition Using Teager Energy and Spectral Feature Fusion with Feature Optimization.基于声门激励能量和频谱特征融合及特征优化的应激语音情感识别
Comput Intell Neurosci. 2023 Oct 11;2023:5765760. doi: 10.1155/2023/5765760. eCollection 2023.
8
Audio Augmentation for Non-Native Children's Speech Recognition through Discriminative Learning.通过判别式学习实现非母语儿童语音识别的音频增强
Entropy (Basel). 2022 Oct 19;24(10):1490. doi: 10.3390/e24101490.
9
Performance enhancement for audio-visual speaker identification using dynamic facial muscle model.使用动态面部肌肉模型提高视听说话人识别性能
Med Biol Eng Comput. 2006 Oct;44(10):919-30. doi: 10.1007/s11517-006-0106-5. Epub 2006 Sep 26.
10
Recognizing the message and the messenger: biomimetic spectral analysis for robust speech and speaker recognition.识别信息与传递者:用于可靠语音和说话人识别的仿生光谱分析
Int J Speech Technol. 2013;16(3):313-322. doi: 10.1007/s10772-012-9184-y. Epub 2012 Dec 18.