IEEE Trans Image Process. 2024;33:2514-2529. doi: 10.1109/TIP.2024.3378459. Epub 2024 Apr 1.
Convolutional neural networks (CNNs) have achieved significant improvement for the task of facial expression recognition. However, current training still suffers from the inconsistent learning intensities among different layers, i.e., the feature representations in the shallow layers are not sufficiently learned compared with those in deep layers. To this end, this work proposes a contrastive learning framework to align the feature semantics of shallow and deep layers, followed by an attention module for representing the multi-scale features in the weight-adaptive manner. The proposed algorithm has three main merits. First, the learning intensity, defined as the magnitude of the backpropagation gradient, of the features on the shallow layer is enhanced by cross-layer contrastive learning. Second, the latent semantics in the shallow-layer and deep-layer features are explored and aligned in the contrastive learning, and thus the fine-grained characteristics of expressions can be taken into account for the feature representation learning. Third, by integrating the multi-scale features from multiple layers with an attention module, our algorithm achieved the state-of-the-art performances, i.e. 92.21%, 89.50%, 62.82%, on three in-the-wild expression databases, i.e. RAF-DB, FERPlus, SFEW, and the second best performance, i.e. 65.29% on AffectNet dataset. Our codes will be made publicly available.
卷积神经网络 (CNNs) 在面部表情识别任务中取得了显著的改进。然而,目前的训练仍然存在不同层之间学习强度不一致的问题,即浅层的特征表示与深层的特征表示相比学习不够充分。为此,本工作提出了一种对比学习框架来对齐浅层和深层的特征语义,然后使用注意力模块以权重自适应的方式表示多尺度特征。所提出的算法具有三个主要优点。首先,通过跨层对比学习增强了浅层特征的学习强度,定义为反向传播梯度的大小。其次,在对比学习中探索和对齐浅层和深层特征的潜在语义,从而可以考虑表情的细粒度特征用于特征表示学习。第三,通过整合来自多个层的多尺度特征和注意力模块,我们的算法在三个野外表情数据库(即 RAF-DB、FERPlus、SFEW)上实现了最先进的性能,即 92.21%、89.50%、62.82%,在 AffectNet 数据集上取得了第二好的性能,即 65.29%。我们的代码将公开。