Faculty of Mechanical - Electrical and Computer Engineering, Van Lang University, Ho Chi Minh City, Vietnam.
Faculty of Information Technology, University of Finance-Marketing, Ho Chi Minh City, Vietnam.
PLoS One. 2024 Apr 26;19(4):e0302394. doi: 10.1371/journal.pone.0302394. eCollection 2024.
Digital speech recognition is a challenging problem that requires the ability to learn complex signal characteristics such as frequency, pitch, intensity, timbre, and melody, which traditional methods often face issues in recognizing. This article introduces three solutions based on convolutional neural networks (CNN) to solve the problem: 1D-CNN is designed to learn directly from digital data; 2DS-CNN and 2DM-CNN have a more complex architecture, transferring raw waveform into transformed images using Fourier transform to learn essential features. Experimental results on four large data sets, containing 30,000 samples for each, show that the three proposed models achieve superior performance compared to well-known models such as GoogLeNet and AlexNet, with the best accuracy of 95.87%, 99.65%, and 99.76%, respectively. With 5-10% higher performance than other models, the proposed solution has demonstrated the ability to effectively learn features, improve recognition accuracy and speed, and open up the potential for broad applications in virtual assistants, medical recording, and voice commands.
数字语音识别是一个具有挑战性的问题,需要能够学习复杂的信号特征,如频率、音高、强度、音色和旋律,而传统方法在识别这些特征时常常存在问题。本文介绍了三种基于卷积神经网络(CNN)的解决方案来解决这个问题:1D-CNN 旨在直接从数字数据中学习;2DS-CNN 和 2DM-CNN 具有更复杂的架构,使用傅里叶变换将原始波形转换为变换后的图像,以学习基本特征。在四个包含 30000 个样本的大型数据集上的实验结果表明,与 GoogLeNet 和 AlexNet 等知名模型相比,所提出的三种模型具有更好的性能,最佳准确率分别为 95.87%、99.65%和 99.76%。与其他模型相比,性能提高了 5-10%,所提出的解决方案展示了有效学习特征、提高识别准确性和速度的能力,并为虚拟助手、医疗记录和语音命令等广泛应用开辟了潜力。