IEEE Trans Pattern Anal Mach Intell. 2021 Nov;43(11):4125-4141. doi: 10.1109/TPAMI.2020.2991965. Epub 2021 Oct 1.
Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in four countries by participants belonging to ten different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions (e.g., 'closing a tap' from 'opening' it up).
自 2018 年推出以来,EPIC-KITCHENS 作为最大的自我中心视频基准引起了关注,提供了人们与物体互动、注意力甚至意图的独特视角。在本文中,我们详细介绍了如何通过 32 名参与者在其原生厨房环境中捕捉到这个大规模数据集,并对其进行了密集的动作和物体交互标注。我们的视频描绘了非脚本化的日常活动,每次参与者进入厨房时都会开始录制。参与者来自十个不同国家,分布在四个国家,因此厨房习惯和烹饪风格非常多样化。我们的数据集包含 55 小时的视频,共 1150 万帧,我们对其进行了密集标注,共标注了 39600 个动作片段和 454200 个物体边界框。我们的标注是独一无二的,因为我们让参与者自己讲述视频(录制后),从而反映了真实的意图,我们根据这些标注众包了真实标签。我们描述了我们的物体、动作和预期挑战,并在两个测试集(可见和不可见厨房)上评估了几个基线。我们引入了新的基线,突出了数据集的多模态性质,以及显式时间建模对区分细粒度动作(例如,从“打开”到“关闭”水龙头)的重要性。