基于低质量网络摄像头图像的特定人注视估计。

Person-Specific Gaze Estimation from Low-Quality Webcam Images.

机构信息

Department of Applied Informatics, Silesian University of Technology, 44-100 Gliwice, Poland.

Faculty of Computer and Information Science, University of Ljubljana, Večna Pot 113, SI-1000 Ljubljana, Slovenia.

出版信息

Sensors (Basel). 2023 Apr 20;23(8):4138. doi: 10.3390/s23084138.

DOI:10.3390/s23084138

PMID:37112478

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10147084/

Abstract

Gaze estimation is an established research problem in computer vision. It has various applications in real life, from human-computer interactions to health care and virtual reality, making it more viable for the research community. Due to the significant success of deep learning techniques in other computer vision tasks-for example, image classification, object detection, object segmentation, and object tracking-deep learning-based gaze estimation has also received more attention in recent years. This paper uses a convolutional neural network (CNN) for person-specific gaze estimation. The person-specific gaze estimation utilizes a single model trained for one individual user, contrary to the commonly-used generalized models trained on multiple people's data. We utilized only low-quality images directly collected from a standard desktop webcam, so our method can be applied to any computer system equipped with such a camera without additional hardware requirements. First, we used the web camera to collect a dataset of face and eye images. Then, we tested different combinations of CNN parameters, including the learning and dropout rates. Our findings show that building a person-specific eye-tracking model produces better results with a selection of good hyperparameters when compared to universal models that are trained on multiple users' data. In particular, we achieved the best results for the left eye with 38.20 MAE (Mean Absolute Error) in pixels, the right eye with 36.01 MAE, both eyes combined with 51.18 MAE, and the whole face with 30.09 MAE, which is equivalent to approximately 1.45 degrees for the left eye, 1.37 degrees for the right eye, 1.98 degrees for both eyes combined, and 1.14 degrees for full-face images.

摘要

注视估计是计算机视觉中的一个成熟研究问题。它在现实生活中有各种应用，从人机交互到医疗保健和虚拟现实，因此对于研究社区来说更具可行性。由于深度学习技术在其他计算机视觉任务中的巨大成功——例如，图像分类、目标检测、目标分割和目标跟踪——基于深度学习的注视估计近年来也受到了更多关注。本文使用卷积神经网络（CNN）进行特定于人的注视估计。特定于人的注视估计使用针对单个用户训练的单个模型，而不是针对多人数据训练的常用通用模型。我们仅使用直接从标准台式网络摄像头收集的低质量图像，因此我们的方法可以应用于任何配备此类摄像头的计算机系统，而无需额外的硬件要求。首先，我们使用网络摄像头收集了一组人脸和眼部图像数据集。然后，我们测试了 CNN 参数的不同组合，包括学习率和辍学率。我们的研究结果表明，与针对多个用户数据训练的通用模型相比，构建特定于人的眼动追踪模型并选择良好的超参数可以产生更好的结果。特别是，我们在左眼获得了 38.20 MAE（平均绝对误差）的最佳结果，在右眼获得了 36.01 MAE，双眼组合获得了 51.18 MAE，整个面部获得了 30.09 MAE，这相当于左眼约为 1.45 度，右眼为 1.37 度，双眼组合为 1.98 度，全脸图像为 1.14 度。