School of Computing Henan University of Engineering, Zhengzhou, China.
Comput Intell Neurosci. 2021 Sep 25;2021:5585041. doi: 10.1155/2021/5585041. eCollection 2021.
The context, such as scenes and objects, plays an important role in video emotion recognition. The emotion recognition accuracy can be further improved when the context information is incorporated. Although previous research has considered the context information, the emotional clues contained in different images may be different, which is often ignored. To address the problem of emotion difference between different modes and different images, this paper proposes a hierarchical attention-based multimodal fusion network for video emotion recognition, which consists of a multimodal feature extraction module and a multimodal feature fusion module. The multimodal feature extraction module has three subnetworks used to extract features of facial, scene, and global images. Each subnetwork consists of two branches, where the first branch extracts the features of different modes, and the other branch generates the emotion score for each image. Features and emotion scores of all images in a modal are aggregated to generate the emotion feature of the modal. The other module takes multimodal features as input and generates the emotion score for each modal. Finally, features and emotion scores of multiple modes are aggregated, and the final emotion representation of the video will be produced. Experimental results show that our proposed method is effective on the emotion recognition dataset.
上下文,如场景和对象,在视频情感识别中起着重要作用。当结合上下文信息时,可以进一步提高情感识别的准确性。尽管以前的研究已经考虑了上下文信息,但不同图像中包含的情感线索可能不同,这通常会被忽略。为了解决不同模态和不同图像之间的情感差异问题,本文提出了一种基于分层注意力的多模态融合网络用于视频情感识别,该网络由多模态特征提取模块和多模态特征融合模块组成。多模态特征提取模块有三个子网,用于提取人脸、场景和全局图像的特征。每个子网由两个分支组成,第一个分支提取不同模式的特征,另一个分支为每张图像生成情感得分。模态中所有图像的特征和情感得分被聚合以生成模态的情感特征。另一个模块将多模态特征作为输入,并为每个模态生成情感得分。最后,聚合多个模态的特征和情感得分,生成视频的最终情感表示。实验结果表明,我们提出的方法在情感识别数据集上是有效的。