College of Information Science and Technology, Nanjing Forestry University, NanJing 100190, China.
Sensors (Basel). 2023 Dec 27;24(1):147. doi: 10.3390/s24010147.
Existing Transformer-based models have achieved impressive success in facial expression recognition (FER) by modeling the long-range relationships among facial muscle movements. However, the size of pure Transformer-based models tends to be in the million-parameter level, which poses a challenge for deploying these models. Moreover, the lack of inductive bias in Transformer usually leads to the difficulty of training from scratch on limited FER datasets. To address these problems, we propose an effective and lightweight variant Transformer for FER called VaTFER. In VaTFER, we firstly construct action unit (AU) tokens by utilizing action unit-based regions and their histogram of oriented gradient (HOG) features. Then, we present a novel spatial-channel feature relevance Transformer (SCFRT) module, which incorporates multilayer channel reduction self-attention (MLCRSA) and a dynamic learnable information extraction (DLIE) mechanism. MLCRSA is utilized to model long-range dependencies among all tokens and decrease the number of parameters. DLIE's goal is to alleviate the lack of inductive bias and improve the learning ability of the model. Furthermore, we use an excitation module to replace the vanilla multilayer perception (MLP) for accurate prediction. To further reduce computing and memory resources, we introduce a binary quantization mechanism, formulating a novel lightweight Transformer model called variant binary Transformer for FER (VaBTFER). We conduct extensive experiments on several commonly used facial expression datasets, and the results attest to the effectiveness of our methods.
现有的基于 Transformer 的模型通过建模面部肌肉运动之间的长程关系,在面部表情识别(FER)方面取得了令人瞩目的成功。然而,纯基于 Transformer 的模型的大小往往在百万参数级别,这给这些模型的部署带来了挑战。此外,Transformer 中缺乏归纳偏差通常导致在有限的 FER 数据集上从头开始训练的困难。为了解决这些问题,我们提出了一种有效的轻量级 FER 专用 Transformer 变体,称为 VaTFER。在 VaTFER 中,我们首先通过利用基于动作单元的区域及其方向梯度直方图(HOG)特征来构建动作单元(AU)令牌。然后,我们提出了一种新颖的空间-通道特征相关性 Transformer(SCFRT)模块,它结合了多层通道减少自注意力(MLCRSA)和动态可学习信息提取(DLIE)机制。MLCRSA 用于建模所有令牌之间的长程依赖关系,并减少参数数量。DLIE 的目标是缓解归纳偏差的缺乏并提高模型的学习能力。此外,我们使用激励模块来替代香草多层感知(MLP)以进行准确预测。为了进一步减少计算和内存资源,我们引入了二进制量化机制,形成了一种新的轻量级 Transformer 模型,称为变体二进制 Transformer 用于 FER(VaBTFER)。我们在几个常用的面部表情数据集上进行了广泛的实验,结果证明了我们方法的有效性。