Suppr超能文献

旁观者眼中:第一人称视频中的注视与动作

In the Eye of the Beholder: Gaze and Actions in First Person Video.

作者信息

Li Yin, Liu Miao, Rehg James M

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6731-6747. doi: 10.1109/TPAMI.2021.3051319. Epub 2023 May 8.

Abstract

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset-EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

摘要

我们致力于通过对头戴式摄像头捕捉的视频进行分析,联合确定一个人正在做什么以及他们正在看哪里的任务。为了便于我们的研究,我们首先引入了EGTEA Gaze+数据集。我们的数据集包含视频、注视跟踪数据、手部掩码和动作注释,从而为第一人称视觉(FPV)提供了最全面的基准。除了数据集之外,我们还提出了一种用于FPV中联合注视估计和动作识别的新型深度模型。我们的方法将参与者的注视描述为一个概率变量,并使用深度网络中的随机单元对其分布进行建模。我们进一步从这些随机单元中进行采样,生成一个注意力图,以指导用于动作识别的视觉特征聚合。我们的方法在我们的EGTEA Gaze+数据集上进行了评估,取得了远超当前最优水平的性能。更重要的是,我们证明了我们的模型即使不使用注视也可以应用于更大规模的FPV数据集——EPIC-Kitchens,在FPV动作识别方面提供了新的最优结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验