Mercier Julien, Ertz Olivier, Bocher Erwan
MEI, School of Engineering and Management Vaud, HES-SO, Switzerland.
Lab-STICC, UMR 6285, CNRS, Université Bretagne Sud, Vannes, France.
J Eye Mov Res. 2024 Apr 29;17(3). doi: 10.16910/jemr.17.3.3. eCollection 2024.
Mobile eye tracking captures egocentric vision and is well-suited for naturalistic studies. However, its data is noisy, especially when acquired outdoor with multiple participants over several sessions. Area of interest analysis on moving targets is difficult because A) camera and objects move nonlinearly and may disappear/reappear from the scene; and B) off-the-shelf analysis tools are limited to linearly moving objects. As a result, researchers resort to time-consuming manual annotation, which limits the use of mobile eye tracking in naturalistic studies. We introduce a method based on a fine-tuned Vision Transformer (ViT) model for classifying frames with overlaying gaze markers. After fine-tuning a model on a manually labelled training set made of 1.98% (=7845 frames) of our entire data for three epochs, our model reached 99.34% accuracy as evaluated on hold-out data. We used the method to quantify participants' dwell time on a tablet during the outdoor user test of a mobile augmented reality application for biodiversity education. We discuss the benefits and limitations of our approach and its potential to be applied to other contexts.
移动眼动追踪能够捕捉以自我为中心的视觉,非常适合自然主义研究。然而,其数据存在噪声,尤其是在户外多个参与者在多个时间段采集数据时。对移动目标进行感兴趣区域分析很困难,原因如下:A)相机和物体非线性移动,可能从场景中消失/重新出现;B)现成的分析工具仅限于线性移动的物体。因此,研究人员只能采用耗时的手动标注,这限制了移动眼动追踪在自然主义研究中的应用。我们介绍一种基于微调视觉Transformer(ViT)模型的方法,用于对带有叠加注视标记的帧进行分类。在由我们全部数据的1.98%(即7845帧)组成的手动标注训练集上对模型进行三个轮次的微调后,我们的模型在留出数据上评估时达到了99.34%的准确率。我们使用该方法在一款用于生物多样性教育的移动增强现实应用的户外用户测试中量化参与者在平板电脑上的停留时间。我们讨论了我们方法的优点和局限性以及其应用于其他场景的潜力。