基于 CTC 的离散语音情感识别中，将二维并行卷积神经网络与自注意力空洞残差网络相结合。

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

机构信息

College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China.

GLAM - Group on Language, Audio, & Music, Imperial College London, UK.

出版信息

Neural Netw. 2021 Sep;141:52-60. doi: 10.1016/j.neunet.2021.03.013. Epub 2021 Mar 23.

DOI:10.1016/j.neunet.2021.03.013

PMID:33866302

Abstract

A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER). Moreover, we also demonstrate the existence of further opportunities to improve SER performance by exploiting the properties of convolutional neural networks (CNNs) when modelling contextual information. Our proposed model uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet), a system herein denoted as PCNSE, to extract relationships from 3D spectrograms across timesteps and frequencies; here, we use the log-Mel spectrogram with deltas and delta-deltas as input. In addition, a self-attention Residual Dilated Network (SADRN) with CTC is employed as a classification block for SER. To the best of the authors' knowledge, this is the first time that such a hybrid architecture has been employed for discrete SER. We further demonstrate the effectiveness of our proposed approach on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and FAU-Aibo Emotion corpus (FAU-AEC). Our experimental results reveal that the proposed method is well-suited to the task of discrete SER, achieving a weighted accuracy (WA) of 73.1% and an unweighted accuracy (UA) of 66.3% on IEMOCAP, as well as a UA of 41.1% on the FAU-AEC dataset.

摘要

从语音中自动识别情感是一个具有挑战性的问题，高效建模长时时间上下文是其中的一个关键挑战。此外，在将特征之间的长期时间依赖关系纳入考虑时，通常默认使用递归神经网络（RNN）架构。在这项工作中，我们旨在提出一种有效的深度学习神经网络架构，该架构结合了连接时间分类（CTC）损失，用于离散语音情感识别（SER）。此外，我们还展示了通过在建模上下文信息时利用卷积神经网络（CNN）的特性，进一步提高 SER 性能的机会。我们提出的模型使用并行卷积层（PCN）与挤压激励网络（SEnet）集成，该系统在此表示为 PCNSE，从 3D 时频谱图中提取时间步和频率上的关系；这里，我们使用对数梅尔频谱图以及其一阶和二阶差分作为输入。此外，还使用带有 CTC 的自注意残差扩张网络（SADRN）作为 SER 的分类块。据作者所知，这是首次将这种混合架构应用于离散 SER。我们还在交互情感对偶运动捕捉（IEMOCAP）和 FAU-Aibo 情感语料库（FAU-AEC）上展示了我们提出的方法的有效性。实验结果表明，该方法非常适合离散 SER 任务，在 IEMOCAP 上的加权准确率（WA）为 73.1%，未加权准确率（UA）为 66.3%，在 FAU-AEC 数据集上的 UA 为 41.1%。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于 CTC 的离散语音情感识别中，将二维并行卷积神经网络与自注意力空洞残差网络相结合。

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

机构信息

出版信息

相似文献

引用本文的文献

基于 CTC 的离散语音情感识别中，将二维并行卷积神经网络与自注意力空洞残差网络相结合。

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

机构信息

出版信息

相似文献

引用本文的文献