基于 FaceNet 框架的迁移学习的语音情感识别。

Speech emotion recognition based on transfer learning from the FaceNet framework.

机构信息

Northeast Normal University, Changchun, Jilin Province 130117, China.

College of Computing and Software Engineering, Kennesaw State University, Marietta, Georgia 30060, USA.

出版信息

J Acoust Soc Am. 2021 Feb;149(2):1338. doi: 10.1121/10.0003530.

DOI:10.1121/10.0003530

PMID:33639796

Abstract

Speech plays an important role in human-computer emotional interaction. FaceNet used in face recognition achieves great success due to its excellent feature extraction. In this study, we adopt the FaceNet model and improve it for speech emotion recognition. To apply this model for our work, speech signals are divided into segments at a given time interval, and the signal segments are transformed into a discrete waveform diagram and spectrogram. Subsequently, the waveform and spectrogram are separately fed into FaceNet for end-to-end training. Our empirical study shows that the pretraining is effective on the spectrogram for FaceNet. Hence, we pretrain the network on the CASIA dataset and then fine-tune it on the IEMOCAP dataset with waveforms. It will derive the maximum transfer learning knowledge from the CASIA dataset due to its high accuracy. This high accuracy may be due to its clean signals. Our preliminary experimental results show an accuracy of 68.96% and 90% on the emotion benchmark datasets IEMOCAP and CASIA, respectively. The cross-training is then conducted on the dataset, and comprehensive experiments are performed. Experimental results indicate that the proposed approach outperforms state-of-the-art methods on the IEMOCAP dataset among single modal approaches.

摘要

语音在人机情感交互中起着重要作用。由于其出色的特征提取能力，在人脸识别中使用的 FaceNet 取得了巨大的成功。在本研究中，我们采用了 FaceNet 模型并对其进行了改进，以用于语音情感识别。为了将该模型应用于我们的工作，我们将语音信号按照给定的时间间隔进行分段，然后将信号段转换为离散的波形图和频谱图。随后，将波形和频谱图分别输入到 FaceNet 中进行端到端训练。我们的实证研究表明，预训练在 FaceNet 的频谱图上是有效的。因此，我们在 CASIA 数据集上进行预训练，然后在 IEMOCAP 数据集上使用波形进行微调。由于其高精度，它将从 CASIA 数据集获得最大的迁移学习知识。这种高精度可能是由于其信号干净。我们的初步实验结果分别在 IEMOCAP 和 CASIA 情感基准数据集上达到了 68.96%和 90%的准确率。然后在数据集上进行交叉训练，并进行全面的实验。实验结果表明，在单模态方法中，该方法在 IEMOCAP 数据集上的表现优于最先进的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于 FaceNet 框架的迁移学习的语音情感识别。

Speech emotion recognition based on transfer learning from the FaceNet framework.

机构信息

出版信息

相似文献

引用本文的文献

基于 FaceNet 框架的迁移学习的语音情感识别。

Speech emotion recognition based on transfer learning from the FaceNet framework.

机构信息

出版信息

相似文献

引用本文的文献