Wang Jingyi
School of Mass-communication and Advertising, Tongmyong University, Busan, Republic of Korea.
Front Neurosci. 2024 Aug 7;18:1449527. doi: 10.3389/fnins.2024.1449527. eCollection 2024.
Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.
面部表情识别(FER)在情感计算中起着至关重要的作用,它通过使机器能够理解和响应人类情感来增强人机交互。尽管深度学习取得了进展,但当前的FER系统在自然环境中常常面临诸如遮挡、头部姿势变化和运动模糊等挑战。这些挑战凸显了对更强大的FER解决方案的需求。为了解决这些问题,我们提出了注意力增强多层Transformer(AEMT)模型,该模型集成了双分支卷积神经网络(CNN)、注意力选择性融合(ASF)模块和具有迁移学习的多层Transformer编码器(MTE)。双分支CNN通过分别处理RGB和局部二值模式(LBP)特征来捕获详细的纹理和颜色信息。ASF模块通过对提取的特征应用全局和局部注意力机制来选择性地增强相关特征。MTE捕获长距离依赖性并对特征之间的复杂关系进行建模,共同提高特征表示和分类准确率。我们的模型在RAF-DB和AffectNet数据集上进行了评估。实验结果表明,AEMT模型在RAF-DB上的准确率达到81.45%,在AffectNet上的准确率达到71.23%,显著优于现有的最先进方法。这些结果表明,我们的模型有效地解决了自然环境中FER的挑战,提供了一种更强大、更准确的解决方案。AEMT模型通过提高复杂现实场景中情感识别的鲁棒性和准确性,显著推动了FER领域的发展。这项工作不仅增强了情感计算系统的能力,还为未来提高模型效率和扩展多模态数据集成的研究开辟了新途径。