Pan Jiahui, Fang Weijie, Zhang Zhihang, Chen Bingzhi, Zhang Zheng, Wang Shuihua
School of SoftwareSouth China Normal University Guangzhou 510631 China.
Shenzhen Medical Biometrics Perception and Analysis Engineering LaboratoryHarbin Institute of Technology Shenzhen 518055 China.
IEEE Open J Eng Med Biol. 2023 Jan 27;5:396-403. doi: 10.1109/OJEMB.2023.3240280. eCollection 2024.
As an essential human-machine interactive task, emotion recognition has become an emerging area over the decades. Although previous attempts to classify emotions have achieved high performance, several challenges remain open: 1) How to effectively recognize emotions using different modalities remains challenging. 2) Due to the increasing amount of computing power required for deep learning, how to provide real-time detection and improve the robustness of deep neural networks is important. In this paper, we propose a deep learning-based multimodal emotion recognition (MER) called Deep-Emotion, which can adaptively integrate the most discriminating features from facial expressions, speech, and electroencephalogram (EEG) to improve the performance of the MER. Specifically, the proposed Deep-Emotion framework consists of three branches, i.e., the facial branch, speech branch, and EEG branch. Correspondingly, the facial branch uses the improved GhostNet neural network proposed in this paper for feature extraction, which effectively alleviates the overfitting phenomenon in the training process and improves the classification accuracy compared with the original GhostNet network. For work on the speech branch, this paper proposes a lightweight fully convolutional neural network (LFCNN) for the efficient extraction of speech emotion features. Regarding the study of EEG branches, we proposed a tree-like LSTM (tLSTM) model capable of fusing multi-stage features for EEG emotion feature extraction. Finally, we adopted the strategy of decision-level fusion to integrate the recognition results of the above three modes, resulting in more comprehensive and accurate performance. Extensive experiments on the CK+, EMO-DB, and MAHNOB-HCI datasets have demonstrated the advanced nature of the Deep-Emotion method proposed in this paper, as well as the feasibility and superiority of the MER approach.
作为一项重要的人机交互任务,情感识别在过去几十年中已成为一个新兴领域。尽管先前对情感进行分类的尝试取得了很高的性能,但仍存在一些挑战:1)如何使用不同模态有效地识别情感仍然具有挑战性。2)由于深度学习所需的计算能力不断增加,如何提供实时检测并提高深度神经网络的鲁棒性很重要。在本文中,我们提出了一种基于深度学习的多模态情感识别(MER)方法,称为深度情感(Deep-Emotion),它可以自适应地整合来自面部表情、语音和脑电图(EEG)的最具区分性的特征,以提高MER的性能。具体而言,所提出的深度情感框架由三个分支组成,即面部分支、语音分支和EEG分支。相应地,面部分支使用本文提出的改进的GhostNet神经网络进行特征提取,与原始GhostNet网络相比,它有效地缓解了训练过程中的过拟合现象并提高了分类准确率。对于语音分支的工作,本文提出了一种轻量级全卷积神经网络(LFCNN),用于高效提取语音情感特征。关于EEG分支的研究,我们提出了一种能够融合多阶段特征的树状长短期记忆网络(tLSTM)模型,用于EEG情感特征提取。最后,我们采用决策级融合策略来整合上述三种模式的识别结果,从而获得更全面、准确的性能。在CK+、EMO-DB和MAHNOB-HCI数据集上进行的大量实验证明了本文提出的深度情感方法的先进性,以及MER方法的可行性和优越性。