基于自注意力模块和合成眼图像的改进特征注视估计。

Improved Feature-Based Gaze Estimation Using Self-Attention Module and Synthetic Eye Images.

机构信息

Department of Electronic Engineering, Kwangwoon University, Seoul 01897, Korea.

Graduate School of Smart Convergence, Kwangwoon Univeristy, Seoul 01897, Korea.

出版信息

Sensors (Basel). 2022 May 26;22(11):4026. doi: 10.3390/s22114026.

DOI:10.3390/s22114026

PMID:35684647

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9183137/

Abstract

Gaze is an excellent indicator and has utility in that it can express interest or intention and the condition of an object. Recent deep-learning methods are mainly appearance-based methods that estimate gaze based on a simple regression from entire face and eye images. However, sometimes, this method does not give satisfactory results for gaze estimations in low-resolution and noisy images obtained in unconstrained real-world settings (e.g., places with severe lighting changes). In this study, we propose a method that estimates gaze by detecting eye region landmarks through a single eye image; and this approach is shown to be competitive with recent appearance-based methods. Our approach acquires rich information by extracting more landmarks and including iris and eye edges, similar to the existing feature-based methods. To acquire strong features even at low resolutions, we used the HRNet backbone network to learn representations of images at various resolutions. Furthermore, we used the self-attention module CBAM to obtain a refined feature map with better spatial information, which enhanced the robustness to noisy inputs, thereby yielding a performance of a 3.18% landmark localization error, a 4% improvement over the existing error and A large number of landmarks were acquired and used as inputs for a lightweight neural network to estimate the gaze. We conducted a within-datasets evaluation on the MPIIGaze, which was obtained in a natural environment and achieved a state-of-the-art performance of 4.32 degrees, a 6% improvement over the existing performance.

摘要

注视是一个很好的指标，具有实用性，因为它可以表达兴趣或意图以及物体的状态。最近的深度学习方法主要是基于外观的方法，它根据整个面部和眼部图像的简单回归来估计注视。然而，在不受约束的真实环境（例如，光照变化剧烈的地方）中获得的低分辨率和嘈杂图像中，有时这种方法对视点估计的结果并不令人满意。在这项研究中，我们提出了一种通过单个眼部图像检测眼部区域地标来估计注视的方法；并且该方法被证明与最近的基于外观的方法具有竞争力。我们的方法通过提取更多地标并包括虹膜和眼睛边缘来获取丰富的信息，类似于现有的基于特征的方法。为了即使在低分辨率下也能获取强特征，我们使用了 HRNet 骨干网络来学习各种分辨率下的图像表示。此外，我们使用了自注意力模块 CBAM 来获取具有更好空间信息的细化特征图，这增强了对噪声输入的鲁棒性，从而使地标定位误差达到 3.18%，比现有误差提高了 4%。大量地标被获取并用作输入，用于轻量级神经网络来估计注视。我们在自然环境中获得的 MPIIGaze 上进行了内部数据集评估，达到了 4.32 度的最先进性能，比现有性能提高了 6%。