一种用于语音情感识别的混合时间分布深度神经架构。

A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition.

作者信息

De Lope Javier, Graña Manuel

机构信息

Department of Artificial Intelligence, Universidad Politécnica de Madrid (UPM), Madrid, Spain.

Computational Intelligence Group, University of the Basque Country (UPV), San Sebastian, Spain.

出版信息

Int J Neural Syst. 2022 Jun;32(6):2250024. doi: 10.1142/S0129065722500241. Epub 2022 May 12.

DOI:10.1142/S0129065722500241

PMID:35575003

Abstract

In recent years, speech emotion recognition (SER) has emerged as one of the most active human-machine interaction research areas. Innovative electronic devices, services and applications are increasingly aiming to check the user emotional state either to issue alerts under some predefined conditions or to adapt the system responses to the user emotions. Voice expression is a very rich and noninvasive source of information for emotion assessment. This paper presents a novel SER approach based on that is a hybrid of a time-distributed convolutional neural network (TD-CNN) and a long short-term memory (LSTM) network. Mel-frequency log-power spectrograms (MFLPSs) extracted from audio recordings are parsed by a sliding window that selects the input for the TD-CNN. The TD-CNN transforms the input image data into a sequence of high-level features that are feed to the LSTM, which carries out the overall signal interpretation. In order to reduce overfitting, the MFLPS representation allows innovative image data augmentation techniques that have no immediate equivalent on the original audio signal. Validation of the proposed hybrid architecture achieves an average recognition accuracy of 73.98% on the most widely and hardest publicly distributed database for SER benchmarking. A permutation test confirms that this result is significantly different from random classification ([Formula: see text]). The proposed architecture outperforms state-of-the-art deep learning models as well as conventional machine learning techniques evaluated on the same database trying to identify the same number of emotions.

摘要

近年来，语音情感识别（SER）已成为最活跃的人机交互研究领域之一。创新的电子设备、服务和应用越来越旨在检查用户的情绪状态，以便在某些预定义条件下发出警报，或使系统响应适应用户情绪。语音表达是用于情感评估的非常丰富且非侵入性的信息来源。本文提出了一种基于时间分布卷积神经网络（TD-CNN）和长短期记忆（LSTM）网络混合的新型SER方法。从音频记录中提取的梅尔频率对数功率谱图（MFLPS）由一个滑动窗口解析，该滑动窗口为TD-CNN选择输入。TD-CNN将输入图像数据转换为一系列高级特征，这些特征被馈送到LSTM，LSTM进行整体信号解释。为了减少过拟合，MFLPS表示允许采用创新的图像数据增强技术，这些技术在原始音频信号上没有直接等效物。在最广泛且最具挑战性的用于SER基准测试的公开数据库上，对所提出的混合架构进行验证，平均识别准确率达到73.98%。排列检验证实该结果与随机分类有显著差异（[公式：见原文]）。所提出的架构优于在同一数据库上评估的试图识别相同数量情绪的最新深度学习模型以及传统机器学习技术。