一种用于语音助手应用的混合语音增强算法。

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application.

机构信息

Department of Artificial Intelligence and Data Science, KPR Institute of Engineering and Technology, Coimbatore 641407, India.

Department of Computer Science and Engineering, KPR Institute of Engineering and Technology, Coimbatore 641407, India.

出版信息

Sensors (Basel). 2021 Oct 23;21(21):7025. doi: 10.3390/s21217025.

DOI:10.3390/s21217025

PMID:34770332

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8588137/

Abstract

In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems' input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.

摘要

近年来，语音识别技术变得越来越普遍。语音质量和可懂度对于语音识别中信息传输的方便性和准确性至关重要。用于对话或存储语音的语音处理系统通常设计用于没有任何背景噪声的环境中。然而，在现实世界的氛围中，以背景噪声和信道噪声形式存在的背景干扰极大地降低了语音识别系统的性能，导致信息传递不准确，使听者疲惫不堪。当通信系统的输入或输出信号受到噪声干扰时，语音增强技术试图提高其性能。为了确保从语音生成的文本的正确性，有必要减少语音音频中涉及的外部噪声。由于语音可以是单个、连续或自发的单词，因此降低音频中的外部噪声很困难。在自动语音识别中，有各种典型的语音增强算法可供选择，这些算法受到了相当多的关注。然而，这些增强算法仅在简单和连续的音频信号中效果良好。因此，在这项研究中，提出了一种混合语音识别算法来提高语音识别的准确性。非线性谱减是一种著名的语音增强算法，与隐马尔可夫模型相结合，并使用 6660 个医疗语音转录音频文件和 1440 个 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)音频文件进行测试。将所提出的模型的性能与各种典型的语音增强算法（如迭代信号增强算法、基于子空间的语音增强和非线性谱减）进行了比较。所提出的级联混合算法在医疗语音和 RAVDESS 语音方面分别实现了最低的单词错误率 9.5%和 7.6%。语音增强和语音到文本转换架构的级联为增强的语音识别提供了更高的准确性。评估结果证实了该方法在涉及术语复杂程度高的实时自动语音识别医疗应用中的应用。