Li Jixiang, Peng Jianxin
IEEE J Biomed Health Inform. 2024 Oct;28(10):6054-6063. doi: 10.1109/JBHI.2024.3430310. Epub 2024 Oct 3.
Emotion is a complex physiological phenomenon, and a single modality may be insufficient for accurately determining human emotional states. This paper proposes an end-to-end multimodal emotion recognition method based on facial expressions and non-contact physiological signals. Facial expression features and remote photoplethysmography (rPPG) signals are extracted from facial video data, and a transformer-based cross-modal attention mechanism (TCMA) is used to learn the correlation between the two modalities. The results show that the accuracy of emotion recognition can be slightly improved by combining facial expressions with accurate rPPG signals. The performance is further improved with the use of TCMA, for which the binary classification accuracy of valence and arousal is 91.11% and 90.00%, respectively. Additionally, when experiments are conducted using the whole dataset, an increased accuracy of 7.31% and 4.23% for the binary classification of valence and arousal, and an improved accuracy of 5.36% for the four classifications of valence-arousal are achieved when TCMA is used in modal fusion, compared to using only facial expression modality, which fully demonstrates the effectiveness and robustness of TCMA. This method makes it possible to realize multimodal emotion recognition of facial expressions and contactless physiological signals in reality.
情绪是一种复杂的生理现象,单一模态可能不足以准确确定人类的情绪状态。本文提出了一种基于面部表情和非接触式生理信号的端到端多模态情绪识别方法。从面部视频数据中提取面部表情特征和远程光电容积脉搏波描记法(rPPG)信号,并使用基于Transformer的跨模态注意力机制(TCMA)来学习这两种模态之间的相关性。结果表明,将面部表情与准确的rPPG信号相结合可以略微提高情绪识别的准确率。使用TCMA可进一步提高性能,其效价和唤醒度的二元分类准确率分别为91.11%和90.00%。此外,在使用整个数据集进行实验时,与仅使用面部表情模态相比,在模态融合中使用TCMA时,效价和唤醒度的二元分类准确率分别提高了7.31%和4.23%,效价-唤醒度的四分类准确率提高了5.36%,充分证明了TCMA的有效性和鲁棒性。该方法使得在现实中实现面部表情和非接触式生理信号的多模态情绪识别成为可能。