基于多头注意力、融合与交互的注视估计网络

Gaze Estimation Network Based on Multi-Head Attention, Fusion, and Interaction.

作者信息

Li Changli, Li Fangfang, Zhang Kao, Chen Nenglun, Pan Zhigeng

机构信息

School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China.

出版信息

Sensors (Basel). 2025 Mar 18;25(6):1893. doi: 10.3390/s25061893.

DOI:10.3390/s25061893

PMID:40293029

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11945386/

Abstract

Gaze is an externally observable indicator of human visual attention, and thus, recording the gaze position can help to solve many problems. Existing gaze estimation models typically utilize separate neural network branches to process data streams from both eyes and the face, failing to fully exploit their feature correlations. This study presents a gaze estimation network that integrates multi-head attention mechanisms, fusion, and interaction strategies to fuse facial features with eye features, as well as features from both eyes, separately. Specifically, multi-head attention and channel attention are used to fuse features from both eyes, and a face and eye interaction module is designed to highlight the most important facial features guided by the eye features; in addition, the channel attention in the Convolutional Block Attention Module (CBAM) is replaced with minimum pooling instead of maximum pooling, and a shortcut connection is added to enhance the network's attention to eye region details. Comparative experiments on three public datasets-Gaze360, MPIIFaceGaze, and EYEDIAP-validate the superiority of the proposed method.

摘要

注视是人类视觉注意力的一种外部可观察指标，因此，记录注视位置有助于解决许多问题。现有的注视估计模型通常利用单独的神经网络分支来处理来自双眼和面部的数据流，未能充分利用它们的特征相关性。本研究提出了一种注视估计网络，该网络集成了多头注意力机制、融合和交互策略，以分别将面部特征与眼睛特征以及双眼特征进行融合。具体而言，使用多头注意力和通道注意力来融合双眼特征，并设计了一个面部与眼睛交互模块，以在眼睛特征的引导下突出最重要的面部特征；此外，将卷积块注意力模块（CBAM）中的通道注意力替换为最小池化而非最大池化，并添加了一条捷径连接以增强网络对眼睛区域细节的关注。在三个公共数据集Gaze360、MPIIFaceGaze和EYEDIAP上进行的对比实验验证了所提方法的优越性。