Peng Cheng, Li Bohao, Zou Kun, Zhang Bowen, Dai Genan, Tsoi Ah Chung
School of Computing, Zhongshan Institute, University of Electronic Science and Technology of China, Zhongshan 528402, China.
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610000, China.
Sensors (Basel). 2025 Jun 18;25(12):3815. doi: 10.3390/s25123815.
This paper addresses the central issue arising from the (SDAC) of facial expressions, namely, to balance the competing demands of good global features for detection, and fine features for good facial expression classifications by replacing the feature extraction part of the "neck" network in the feature pyramid network in the You Only Look Once X (YOLOX) framework with a novel architecture involving three attention mechanisms-batch, channel, and neighborhood-which respectively explores the three input dimensions-batch, channel, and spatial. Correlations across a batch of images in the individual path of the dual incoming paths are first extracted by a self attention mechanism in the batch dimension; these two paths are fused together to consolidate their information and then split again into two separate paths; the information along the channel dimension is extracted using a generalized form of channel attention, an adaptive graph channel attention, which provides each element of the incoming signal with a weight that is adapted to the incoming signal. The combination of these two paths, together with two skip connections from the input to the batch attention to the output of the adaptive channel attention, then passes into a residual network, with neighborhood attention to extract fine features in the spatial dimension. This novel dual path architecture has been shown experimentally to achieve a better balance between the competing demands in an SDAC problem than other competing approaches. Ablation studies enable the determination of the relative importance of these three attention mechanisms. Competitive results are obtained on two non-aligned face expression recognition datasets, RAF-DB and SFEW, when compared with other state-of-the-art methods.
本文探讨了面部表情(SDAC)中出现的核心问题,即通过用一种涉及三种注意力机制(批次、通道和邻域)的新颖架构替换You Only Look Once X(YOLOX)框架中特征金字塔网络的“颈部”网络的特征提取部分,来平衡检测所需的良好全局特征和面部表情良好分类所需的精细特征之间相互竞争的需求,这三种注意力机制分别探索三个输入维度(批次、通道和空间)。首先通过批次维度中的自注意力机制提取双输入路径中单个路径上一批图像之间的相关性;这两条路径融合在一起以整合它们的信息,然后再次拆分为两条单独的路径;沿着通道维度的信息使用一种广义形式的通道注意力(自适应图通道注意力)来提取,它为输入信号的每个元素提供一个适应输入信号的权重。这两条路径的组合,连同从输入到批次注意力再到自适应通道注意力输出的两个跳跃连接,然后进入一个残差网络,利用邻域注意力在空间维度中提取精细特征。实验表明,这种新颖的双路径架构在SDAC问题中比其他竞争方法能更好地平衡相互竞争的需求。消融研究能够确定这三种注意力机制的相对重要性。与其他当前最先进的方法相比,在两个未对齐的面部表情识别数据集RAF-DB和SFEW上获得了具有竞争力的结果。