College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China.
GLAM - Group on Language, Audio, & Music, Imperial College London, UK.
Neural Netw. 2021 Sep;141:52-60. doi: 10.1016/j.neunet.2021.03.013. Epub 2021 Mar 23.
A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER). Moreover, we also demonstrate the existence of further opportunities to improve SER performance by exploiting the properties of convolutional neural networks (CNNs) when modelling contextual information. Our proposed model uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet), a system herein denoted as PCNSE, to extract relationships from 3D spectrograms across timesteps and frequencies; here, we use the log-Mel spectrogram with deltas and delta-deltas as input. In addition, a self-attention Residual Dilated Network (SADRN) with CTC is employed as a classification block for SER. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for discrete SER. We further demonstrate the effectiveness of our proposed approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpus (FAU-AEC). Our experimental results reveal that the proposed method is well-suited to the task of discrete SER, achieving a weighted accuracy (WA) of 73.1% and an unweighted accuracy (UA) of 66.3% on IEMOCAP, as well as a UA of 41.1% on the FAU-AEC dataset.
从语音中自动识别情感是一个具有挑战性的问题,高效建模长时时间上下文是其中的一个关键挑战。此外,在将特征之间的长期时间依赖关系纳入考虑时,通常默认使用递归神经网络(RNN)架构。在这项工作中,我们旨在提出一种有效的深度学习神经网络架构,该架构结合了连接时间分类(CTC)损失,用于离散语音情感识别(SER)。此外,我们还展示了通过在建模上下文信息时利用卷积神经网络(CNN)的特性,进一步提高 SER 性能的机会。我们提出的模型使用并行卷积层(PCN)与挤压激励网络(SEnet)集成,该系统在此表示为 PCNSE,从 3D 时频谱图中提取时间步和频率上的关系;这里,我们使用对数梅尔频谱图以及其一阶和二阶差分作为输入。此外,还使用带有 CTC 的自注意残差扩张网络(SADRN)作为 SER 的分类块。据作者所知,这是首次将这种混合架构应用于离散 SER。我们还在交互情感对偶运动捕捉(IEMOCAP)和 FAU-Aibo 情感语料库(FAU-AEC)上展示了我们提出的方法的有效性。实验结果表明,该方法非常适合离散 SER 任务,在 IEMOCAP 上的加权准确率(WA)为 73.1%,未加权准确率(UA)为 66.3%,在 FAU-AEC 数据集上的 UA 为 41.1%。