在Transformer视角下：用于自我中心注视估计及其他的全局-局部相关性

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

作者信息

Lai Bolin, Liu Miao, Ryan Fiona, Rehg James M

机构信息

Georgia Institute of Technology, Atlanta, GA 30308 USA.

Meta AI, Menlo Park, CA 94025 USA.

出版信息

Int J Comput Vis. 2024;132(3):854-871. doi: 10.1007/s11263-023-01879-7. Epub 2023 Oct 18.

DOI:10.1007/s11263-023-01879-7

PMID:38371492

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10873248/

Abstract

Predicting human's gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global-local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).

摘要

从自我中心视角视频预测人类注视方向在理解日常活动中的人类意图方面起着关键作用。在本文中，我们提出了首个基于Transformer的模型，以解决自我中心视角注视估计这一具有挑战性的问题。我们观察到，全局场景上下文与局部视觉信息之间的联系对于从自我中心视角视频帧中定位注视点至关重要。为此，我们设计了Transformer编码器将全局上下文嵌入为一个额外的视觉令牌，并进一步提出了一种新颖的全局-局部相关性模块，以明确建模全局令牌与每个局部令牌之间的相关性。我们在两个自我中心视角视频数据集——EGTEA Gaze+和Ego4D上验证了我们的模型。详细的消融研究证明了我们方法的优势。此外，我们的方法大幅超越了先前的最先进模型。我们还将我们的模型应用于一个新颖的注视扫视/注视预测任务以及传统的动作识别问题。一致的提升表明我们的模型具有强大的泛化能力。我们还提供了额外的可视化结果，以支持我们的观点，即全局-局部相关性是从自我中心视角视频预测注视点的关键表示。更多细节可在我们的网站(https://bolinlai.github.io/GLC-EgoGazeEst)上找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d949/10873248/911ea82d8ed4/11263_2023_1879_Fig1_HTML.jpg

相似文献

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

Int J Comput Vis. 2024;132(3):854-871. doi: 10.1007/s11263-023-01879-7. Epub 2023 Oct 18.

In the Eye of the Beholder: Gaze and Actions in First Person Video.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6731-6747. doi: 10.1109/TPAMI.2021.3051319. Epub 2023 May 8.

Together Recognizing, Localizing and Summarizing Actions in Egocentric Videos.

IEEE Trans Image Process. 2021;30:4330-4340. doi: 10.1109/TIP.2021.3070732. Epub 2021 Apr 16.

TLTNet: A novel transscale cascade layered transformer network for enhanced retinal blood vessel segmentation.

Comput Biol Med. 2024 Aug;178:108773. doi: 10.1016/j.compbiomed.2024.108773. Epub 2024 Jun 25.

Delving into Egocentric Actions.

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2015 Jun;2015:287-295. doi: 10.1109/CVPR.2015.7298625.

Ego4D: Around the World in 3,000 Hours of Egocentric Video.

IEEE Trans Pattern Anal Mach Intell. 2024 Jul 26;PP. doi: 10.1109/TPAMI.2024.3381075.

Deep Attention Network for Egocentric Action Recognition.

IEEE Trans Image Process. 2019 Aug;28(8):3703-3713. doi: 10.1109/TIP.2019.2901707. Epub 2019 Feb 26.

Gaze-enabled Egocentric Video Summarization via Constrained Submodular Maximization.

Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2015 Jun;2015:2235-2244. doi: 10.1109/CVPR.2015.7298836.

Anticipating Where People will Look Using Adversarial Networks.

IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1783-1796. doi: 10.1109/TPAMI.2018.2871688. Epub 2018 Sep 24.

Learning to Recognize Actions on Objects in Egocentric Video With Attention Dictionaries.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6674-6687. doi: 10.1109/TPAMI.2021.3058649. Epub 2023 May 5.

本文引用的文献

Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection.

IEEE Trans Neural Netw Learn Syst. 2024 Aug;35(8):10663-10673. doi: 10.1109/TNNLS.2023.3243246. Epub 2024 Aug 5.

Salient Object Detection via Integrity Learning.

IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3738-3752. doi: 10.1109/TPAMI.2022.3179526. Epub 2023 Feb 3.

In the Eye of the Beholder: Gaze and Actions in First Person Video.

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6731-6747. doi: 10.1109/TPAMI.2021.3051319. Epub 2023 May 8.

Contextual encoder-decoder network for visual saliency prediction.

Neural Netw. 2020 Sep;129:261-270. doi: 10.1016/j.neunet.2020.05.004. Epub 2020 May 8.

How is Gaze Influenced by Image Transformations? Dataset and Model.

IEEE Trans Image Process. 2019 Oct 11. doi: 10.1109/TIP.2019.2945857.

Anticipating Where People will Look Using Adversarial Networks.

IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1783-1796. doi: 10.1109/TPAMI.2018.2871688. Epub 2018 Sep 24.

Video Salient Object Detection via Fully Convolutional Networks.

IEEE Trans Image Process. 2018;27(1):38-49. doi: 10.1109/TIP.2017.2754941.

DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations.

IEEE Trans Image Process. 2017 Sep;26(9):4446-4456. doi: 10.1109/TIP.2017.2710620.

Eye movements in natural behavior.

Trends Cogn Sci. 2005 Apr;9(4):188-94. doi: 10.1016/j.tics.2005.02.009.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在Transformer视角下：用于自我中心注视估计及其他的全局-局部相关性

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

作者信息

Lai Bolin, Liu Miao, Ryan Fiona, Rehg James M

机构信息

Georgia Institute of Technology, Atlanta, GA 30308 USA.

Meta AI, Menlo Park, CA 94025 USA.

出版信息

Int J Comput Vis. 2024;132(3):854-871. doi: 10.1007/s11263-023-01879-7. Epub 2023 Oct 18.

DOI:10.1007/s11263-023-01879-7

PMID:38371492

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10873248/

Abstract

摘要

在Transformer视角下：用于自我中心注视估计及其他的全局-局部相关性

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

在Transformer视角下：用于自我中心注视估计及其他的全局-局部相关性

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation and Beyond.

作者信息

机构信息

出版信息

相似文献

本文引用的文献