Du Yiming, Li Penghai, Cheng Longlong, Zhang Xuanwei, Li Mingji, Li Fengzhou
School of Integrated Circuit Science and Engineering, Tianjin University of Technology, Tianjin, China.
China Electronics Cloud Brain (Tianjin) Technology Co, Ltd., Tianjin, China.
Front Neurosci. 2024 Jan 10;17:1330077. doi: 10.3389/fnins.2023.1330077. eCollection 2023.
Multimodal emotion recognition has become a hot topic in human-computer interaction and intelligent healthcare fields. However, combining information from different human different modalities for emotion computation is still challenging.
In this paper, we propose a three-dimensional convolutional recurrent neural network model (referred to as 3FACRNN network) based on multimodal fusion and attention mechanism. The 3FACRNN network model consists of a visual network and an EEG network. The visual network is composed of a cascaded convolutional neural network-time convolutional network (CNN-TCN). In the EEG network, the 3D feature building module was added to integrate band information, spatial information and temporal information of the EEG signal, and the band attention and self-attention modules were added to the convolutional recurrent neural network (CRNN). The former explores the effect of different frequency bands on network recognition performance, while the latter is to obtain the intrinsic similarity of different EEG samples.
To investigate the effect of different frequency bands on the experiment, we obtained the average attention mask for all subjects in different frequency bands. The distribution of the attention masks across the different frequency bands suggests that signals more relevant to human emotions may be active in the high frequency bands γ (31-50 Hz). Finally, we try to use the multi-task loss function Lc to force the approximation of the intermediate feature vectors of the visual and EEG modalities, with the aim of using the knowledge of the visual modalities to improve the performance of the EEG network model. The mean recognition accuracy and standard deviation of the proposed method on the two multimodal sentiment datasets DEAP and MAHNOB-HCI (arousal, valence) were 96.75 ± 1.75, 96.86 ± 1.33; 97.55 ± 1.51, 98.37 ± 1.07, better than those of the state-of-the-art multimodal recognition approaches.
The experimental results show that starting from the multimodal information, the facial video frames and electroencephalogram (EEG) signals of the subjects are used as inputs to the emotion recognition network, which can enhance the stability of the emotion network and improve the recognition accuracy of the emotion network. In addition, in future work, we will try to utilize sparse matrix methods and deep convolutional networks to improve the performance of multimodal emotion networks.
多模态情感识别已成为人机交互和智能医疗领域的热门话题。然而,将来自人类不同模态的信息进行融合以用于情感计算仍然具有挑战性。
在本文中,我们提出了一种基于多模态融合和注意力机制的三维卷积递归神经网络模型(称为3FACRNN网络)。3FACRNN网络模型由视觉网络和脑电图(EEG)网络组成。视觉网络由级联卷积神经网络-时间卷积网络(CNN-TCN)构成。在EEG网络中,添加了三维特征构建模块以整合EEG信号的频段信息、空间信息和时间信息,并在卷积递归神经网络(CRNN)中添加了频段注意力和自注意力模块。前者探究不同频段对网络识别性能的影响,而后者则是为了获取不同EEG样本的内在相似性。
为了研究不同频段对实验的影响,我们获得了所有受试者在不同频段上的平均注意力掩码。不同频段上注意力掩码的分布表明,与人类情感更相关的信号可能在高频段γ(31 - 50Hz)中更活跃。最后,我们尝试使用多任务损失函数Lc来促使视觉和EEG模态的中间特征向量近似,目的是利用视觉模态的知识来提高EEG网络模型的性能。所提方法在两个多模态情感数据集DEAP和MAHNOB-HCI(唤醒度、效价)上的平均识别准确率和标准差分别为96.75 ± 1.75、96.86 ± 1.33;97.55 ± 1.51、98.37 ± 1.07,优于当前最先进的多模态识别方法。
实验结果表明,从多模态信息出发,将受试者的面部视频帧和脑电图(EEG)信号作为情感识别网络的输入,可以增强情感网络的稳定性并提高情感网络的识别准确率。此外,在未来的工作中,我们将尝试利用稀疏矩阵方法和深度卷积网络来提高多模态情感网络的性能。