Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Jinhua 321004, China.
Int J Environ Res Public Health. 2023 Jan 12;20(2):1400. doi: 10.3390/ijerph20021400.
The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.
人类的情感是其心理状态的重要指标或反映,例如满意或压力,而从不同媒体中识别或检测情感对于执行序列分析或某些应用(例如心理健康评估、工作压力水平估计和游客满意度评估)至关重要。基于计算机视觉技术的情感识别作为从人类行为的视觉媒体(例如图像或视频)中检测情感的重要方法,由于其重要的应用而得到了广泛的研究。然而,大多数现有模型忽略了特征之间的交互作用,并且使用简单的串联进行特征融合,无法捕捉视频剪辑中面部和上下文信息之间的关键互补增益,这对于解决情感混淆和情感误解问题非常重要。因此,在本文中,为了充分利用面部和上下文特征之间的互补信息,我们提出了一种新颖的交叉注意力和混合特征加权网络,以从大规模视频剪辑中实现准确的情感识别,所提出的模型由双分支编码 (DBE) 网络、层次注意力编码 (HAE) 网络和深度融合 (DF) 块组成。具体来说,DBE 网络中的面部和上下文编码块生成各自的浅层特征。之后,HAE 网络使用交叉注意力 (CA) 块通过交叉通道注意力操作研究和捕获面部表情特征与其上下文之间的互补性。引入元素重新校准 (ER) 块通过嵌入全局信息来修正每个通道的特征图。此外,HAE 网络中的自适应注意力 (AA) 块通过混合特征加权操作推断出最佳特征融合权重,并获得自适应情感特征。最后,DF 块将这些自适应情感特征集成起来,以预测个体的情绪状态。CAER-S 数据集的广泛实验结果证明了我们方法的有效性,表明其在分析带有视频剪辑的游客评论、利用视觉情感证据估计工作压力水平或利用视觉媒体评估心理健康方面具有潜力。