构建用于面部视频情感识别的多样化图变换模型。

CDGT: Constructing diverse graph transformers for emotion recognition from facial videos.

机构信息

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510641, China.

School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou 510665, China.

出版信息

Neural Netw. 2024 Nov;179:106573. doi: 10.1016/j.neunet.2024.106573. Epub 2024 Jul 25.

DOI:10.1016/j.neunet.2024.106573

PMID:39096753

Abstract

Recognizing expressions from dynamic facial videos can find more natural affect states of humans, and it becomes a more challenging task in real-world scenes due to pose variations of face, partial occlusions and subtle dynamic changes of emotion sequences. Existing transformer-based methods often focus on self-attention to model the global relations among spatial features or temporal features, which cannot well focus on important expression-related locality structures from both spatial and temporal features for the in-the-wild expression videos. To this end, we incorporate diverse graph structures into transformers and propose a CDGT method to construct diverse graph transformers for efficient emotion recognition from in-the-wild videos. Specifically, our method contains a spatial dual-graphs transformer and a temporal hyperbolic-graph transformer. The former deploys a dual-graph constrained attention to capture latent emotion-related graph geometry structures among local spatial tokens for efficient feature representation, especially for the video frames with pose variations and partial occlusions. The latter adopts a hyperbolic-graph constrained self-attention that explores important temporal graph structure information under hyperbolic space to model more subtle changes of dynamic emotion. Extensive experimental results on in-the-wild video-based facial expression databases show that our proposed CDGT outperforms other state-of-the-art methods.

摘要

从动态面部视频中识别表情可以发现人类更自然的情感状态，但由于人脸姿势变化、部分遮挡和情感序列的微妙动态变化，在真实场景中，这是一项更具挑战性的任务。现有的基于转换器的方法通常侧重于自我注意来建模空间特征或时间特征之间的全局关系，但对于野外表情视频，它们不能很好地关注来自空间和时间特征的重要与表情相关的局部结构。为此，我们将多种图结构纳入到转换器中，并提出了一种 CDGT 方法，用于构建用于从野外视频中进行高效情感识别的多样化图转换器。具体来说，我们的方法包含一个空间双图转换器和一个时间双曲图转换器。前者部署了一个双图约束注意力机制，以捕捉局部空间令牌之间潜在的与情绪相关的图几何结构，从而实现高效的特征表示，特别是对于具有姿势变化和部分遮挡的视频帧。后者采用了一个双曲图约束的自我注意机制，在双曲空间下探索重要的时间图结构信息，以建模更微妙的动态情感变化。在野外视频面部表情数据库上进行的广泛实验结果表明，我们提出的 CDGT 优于其他最先进的方法。