用于巴基斯坦种族说话者识别分类的数据增强与深度神经网络

Data augmentation and deep neural networks for the classification of Pakistani racial speakers recognition.

作者信息

Amjad Ammar, Khan Lal, Chang Hsien-Tsung

机构信息

Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.

Bachelor Program in Artificial Intelligence, Chang Gung University, Taoyaun, Taiwan.

出版信息

PeerJ Comput Sci. 2022 Aug 3;8:e1053. doi: 10.7717/peerj-cs.1053. eCollection 2022.

DOI:10.7717/peerj-cs.1053

PMID:36091976

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9454772/

Abstract

Speech emotion recognition (SER) systems have evolved into an important method for recognizing a person in several applications, including e-commerce, everyday interactions, law enforcement, and forensics. The SER system's efficiency depends on the length of the audio samples used for testing and training. However, the different suggested models successfully obtained relatively high accuracy in this study. Moreover, the degree of SER efficiency is not yet optimum due to the limited database, resulting in overfitting and skewing samples. Therefore, the proposed approach presents a data augmentation method that shifts the pitch, uses multiple window sizes, stretches the time, and adds white noise to the original audio. In addition, a deep model is further evaluated to generate a new paradigm for SER. The data augmentation approach increased the limited amount of data from the Pakistani racial speaker speech dataset in the proposed system. The seven-layer framework was employed to provide the most optimal performance in terms of accuracy compared to other multilayer approaches. The seven-layer method is used in existing works to achieve a very high level of accuracy. The suggested system achieved 97.32% accuracy with a 0.032% loss in the 75%:25% splitting ratio. In addition, more than 500 augmentation data samples were added. Therefore, the proposed approach results show that deep neural networks with data augmentation can enhance the SER performance on the Pakistani racial speech dataset.

摘要

语音情感识别（SER）系统已发展成为在包括电子商务、日常互动、执法和法医学等多种应用中识别人的一种重要方法。SER系统的效率取决于用于测试和训练的音频样本长度。然而，在本研究中，不同的推荐模型成功获得了相对较高的准确率。此外，由于数据库有限，SER效率程度尚未达到最佳，导致样本过拟合和偏差。因此，所提出的方法提出了一种数据增强方法，该方法可改变音高、使用多种窗口大小、拉伸时间并向原始音频添加白噪声。此外，还进一步评估了一个深度模型，以生成一种新的SER范式。数据增强方法增加了所提出系统中来自巴基斯坦种族说话者语音数据集的有限数据量。与其他多层方法相比，采用七层框架在准确率方面提供了最优性能。七层方法在现有工作中被用于实现非常高的准确率。所建议的系统在75%:25%的分割比例下实现了97.32%的准确率，损失率为0.032%。此外，还添加了500多个增强数据样本。因此，所提出方法的结果表明，具有数据增强的深度神经网络可以提高在巴基斯坦种族语音数据集上的SER性能。