Li Yujie, Chen Jiahui, Ma Jiaxin, Wang Xiwen, Zhang Wei
School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China.
Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, Guilin 541004, China.
Sensors (Basel). 2023 Jul 7;23(13):6226. doi: 10.3390/s23136226.
The direction of human gaze is an important indicator of human behavior, reflecting the level of attention and cognitive state towards various visual stimuli in the environment. Convolutional neural networks have achieved good performance in gaze estimation tasks, but their global modeling capability is limited, making it difficult to further improve prediction performance. In recent years, transformer models have been introduced for gaze estimation and have achieved state-of-the-art performance. However, their slicing-and-mapping mechanism for processing local image patches can compromise local spatial information. Moreover, the single down-sampling rate and fixed-size tokens are not suitable for multiscale feature learning in gaze estimation tasks. To overcome these limitations, this study introduces a Swin Transformer for gaze estimation and designs two network architectures: a pure Swin Transformer gaze estimation model (SwinT-GE) and a hybrid gaze estimation model that combines convolutional structures with SwinT-GE (Res-Swin-GE). SwinT-GE uses the tiny version of the Swin Transformer for gaze estimation. Res-Swin-GE replaces the slicing-and-mapping mechanism of SwinT-GE with convolutional structures. Experimental results demonstrate that Res-Swin-GE significantly outperforms SwinT-GE, exhibiting strong competitiveness on the MpiiFaceGaze dataset and achieving a 7.5% performance improvement over existing state-of-the-art methods on the Eyediap dataset.
人类注视方向是人类行为的一个重要指标,反映了对环境中各种视觉刺激的关注程度和认知状态。卷积神经网络在注视估计任务中取得了良好的性能,但其全局建模能力有限,难以进一步提高预测性能。近年来,Transformer模型被引入到注视估计中,并取得了最优性能。然而,其处理局部图像块的切片映射机制可能会损害局部空间信息。此外,单一的下采样率和固定大小的令牌不适合注视估计任务中的多尺度特征学习。为了克服这些限制,本研究引入了一种用于注视估计的Swin Transformer,并设计了两种网络架构:一种纯Swin Transformer注视估计模型(SwinT-GE)和一种将卷积结构与SwinT-GE相结合的混合注视估计模型(Res-Swin-GE)。SwinT-GE使用Swin Transformer的微小版本进行注视估计。Res-Swin-GE用卷积结构取代了SwinT-GE的切片映射机制。实验结果表明,Res-Swin-GE显著优于SwinT-GE,在MpiiFaceGaze数据集上表现出强大的竞争力,并且在Eyediap数据集上比现有最优方法性能提高了7.5%。