旁观者眼中：第一人称视频中的注视与动作

In the Eye of the Beholder: Gaze and Actions in First Person Video.

作者信息

Li Yin, Liu Miao, Rehg James M

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6731-6747. doi: 10.1109/TPAMI.2021.3051319. Epub 2023 May 8.

DOI:10.1109/TPAMI.2021.3051319

Abstract

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset-EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

摘要

我们致力于通过对头戴式摄像头捕捉的视频进行分析，联合确定一个人正在做什么以及他们正在看哪里的任务。为了便于我们的研究，我们首先引入了EGTEA Gaze+数据集。我们的数据集包含视频、注视跟踪数据、手部掩码和动作注释，从而为第一人称视觉（FPV）提供了最全面的基准。除了数据集之外，我们还提出了一种用于FPV中联合注视估计和动作识别的新型深度模型。我们的方法将参与者的注视描述为一个概率变量，并使用深度网络中的随机单元对其分布进行建模。我们进一步从这些随机单元中进行采样，生成一个注意力图，以指导用于动作识别的视觉特征聚合。我们的方法在我们的EGTEA Gaze+数据集上进行了评估，取得了远超当前最优水平的性能。更重要的是，我们证明了我们的模型即使不使用注视也可以应用于更大规模的FPV数据集——EPIC-Kitchens，在FPV动作识别方面提供了新的最优结果。

相似文献

In the Eye of the Beholder: Gaze and Actions in First Person Video.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6731-6747. doi: 10.1109/TPAMI.2021.3051319. Epub 2023 May 8.

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

Int J Comput Vis. 2024;132(3):854-871. doi: 10.1007/s11263-023-01879-7. Epub 2023 Oct 18.

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video.

IEEE Trans Pattern Anal Mach Intell. 2021 Nov;43(11):4021-4036. doi: 10.1109/TPAMI.2020.2992889. Epub 2021 Oct 1.

Desktop Action Recognition From First-Person Point-of-View.

IEEE Trans Cybern. 2019 May;49(5):1616-1628. doi: 10.1109/TCYB.2018.2806381. Epub 2018 Feb 27.

When I Look into Your Eyes: A Survey on Computer Vision Contributions for Human Gaze Estimation and Tracking.

Sensors (Basel). 2020 Jul 3;20(13):3739. doi: 10.3390/s20133739.

Deep Attention Network for Egocentric Action Recognition.

IEEE Trans Image Process. 2019 Aug;28(8):3703-3713. doi: 10.1109/TIP.2019.2901707. Epub 2019 Feb 26.

Camera-Assisted Video Saliency Prediction and Its Applications.

IEEE Trans Cybern. 2018 Sep;48(9):2520-2530. doi: 10.1109/TCYB.2017.2741498. Epub 2017 Dec 21.

Appearance-based gaze estimation using visual saliency.

IEEE Trans Pattern Anal Mach Intell. 2013 Feb;35(2):329-41. doi: 10.1109/TPAMI.2012.101.

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6618-6630. doi: 10.1109/TPAMI.2021.3061479. Epub 2023 May 5.

Uncertainty-Aware Gaze Tracking for Assisted Living Environments.

IEEE Trans Image Process. 2023;32:2335-2347. doi: 10.1109/TIP.2023.3253253. Epub 2023 Apr 21.

引用本文的文献

Influence of training and expertise on deep neural network attention and human attention during a medical image classification task.

J Vis. 2024 Apr 1;24(4):6. doi: 10.1167/jov.24.4.6.

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

Int J Comput Vis. 2024;132(3):854-871. doi: 10.1007/s11263-023-01879-7. Epub 2023 Oct 18.

Visual Object Tracking in First Person Vision.

Int J Comput Vis. 2023;131(1):259-283. doi: 10.1007/s11263-022-01694-6. Epub 2022 Oct 18.

Impacts of Image Obfuscation on Fine-grained Activity Recognition in Egocentric Video.

Proc IEEE Int Conf Pervasive Comput Commun Workshops. 2022 Mar;2022:341-346. doi: 10.1109/percomworkshops53856.2022.9767447. Epub 2022 May 6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

旁观者眼中：第一人称视频中的注视与动作

In the Eye of the Beholder: Gaze and Actions in First Person Video.

作者信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献