Song Hongchen, Zhang Long, Gao Meixian, Zhang Hengyuan, Hain Thomas, Shan Linlin
College of Computer and Information Engineering, Tianjin Normal University, Tianjin, 300387, China.
School of Computer Science, The University of Sheffield, Sheffield, UK.
Sci Rep. 2025 Jul 1;15(1):21607. doi: 10.1038/s41598-025-94727-2.
Extracting richer emotional representations from raw speech is one of the key approaches to improving the accuracy of Speech Emotion Recognition (SER). In recent years, there has been a trend in utilizing self-supervised learning (SSL) for extracting SER features, due to the exceptional performance of SSL in Automatic Speech Recognition (ASR). However, existing SSL methods are not sufficiently sensitive in capturing emotional information, making them less effective for SER tasks. To overcome this issue, this study proposes MS-EmoBoost, a novel strategy for enhancing self-supervised speech emotion representations. Specifically, MS-EmoBoost uses the deep emotional information from Melfrequency cepstral coefficient (MFCC) and spectrogram as guidance to enhance the emotional representation capabilities of self-supervised features. To determine the effectiveness of our proposed approach, we conduct a comprehensive experiment on three benchmark speech emotion datasets: IEMOCAP, EMODB, and EMOVO. The SER performance is measured by weighted accuracy (WA) and unweighted accuracy (UA). The experimental results show that our method successfully enhances the emotional representation capability of wav2vec 2.0 Base features, achieving competitive performance in SER tasks (IEMOCAP:WA,72.10%; UA,72.91%; EMODB:WA,92.45%; UA,92.62%; EMOVO:WA,86.88%; UA,87.51%), and proves effective for other self-supervised features.
从原始语音中提取更丰富的情感表征是提高语音情感识别(SER)准确率的关键方法之一。近年来,由于自监督学习(SSL)在自动语音识别(ASR)中表现出色,利用自监督学习来提取SER特征成为一种趋势。然而,现有的SSL方法在捕捉情感信息方面不够敏感,导致它们在SER任务中效果不佳。为了克服这个问题,本研究提出了MS-EmoBoost,一种增强自监督语音情感表征的新策略。具体而言,MS-EmoBoost利用来自梅尔频率倒谱系数(MFCC)和频谱图的深度情感信息作为指导,来增强自监督特征的情感表征能力。为了确定我们提出的方法的有效性,我们在三个基准语音情感数据集上进行了全面实验:IEMOCAP、EMODB和EMOVO。SER性能通过加权准确率(WA)和非加权准确率(UA)来衡量。实验结果表明,我们的方法成功增强了wav2vec 2.0基础特征的情感表征能力,在SER任务中取得了有竞争力的性能(IEMOCAP:WA,72.10%;UA,72.91%;EMODB:WA,92.45%;UA,92.62%;EMOVO:WA,86.88%;UA,87.51%),并证明对其他自监督特征有效。