Tao Huawei, Shan Shuai, Hu Ziyi, Zhu Chunhua, Ge Hongyi
Key Laboratory of Food Information Processing and Control, Ministry of Education, Henan University of Technology, Zhengzhou 450001, China.
Henan Engineering Laboratory of Grain IOT Technology, Henan University of Technology, Zhengzhou 450001, China.
Entropy (Basel). 2022 Dec 30;25(1):68. doi: 10.3390/e25010068.
The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2-9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.
标注样本的缺失限制了语音情感识别(SER)的发展。数据增强是解决样本稀疏问题的有效方法。然而,在SER领域,缺乏对数据增强算法的研究。本文分析了经典声学数据增强方法在SER中的有效性,并在此基础上提出了一种基于有效数据增强的强泛化语音情感识别模型。该模型使用由多个子网组成的多通道特征提取器来提取情感表征。将能够有效提高SER性能的不同类型的增强数据输入到子网中,并通过对每个子网的输出特征图进行加权融合来获得情感表征。为了使模型对未见过的说话者具有鲁棒性,我们采用对抗训练来泛化情感表征。使用一个判别器来估计不同说话者特征分布之间的Wasserstein距离,并通过对抗训练迫使特征提取器学习与说话者无关的情感表征。在IEMOCAP语料库上的仿真实验结果表明,该方法的性能比相关的SER算法高出2%-9%,证明了该方法的有效性。