Suppr超能文献

基于跨视频语义挖掘的整体引导解缠学习用于并发第一人称和第三人称活动识别。

Holistic-Guided Disentangled Learning With Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition.

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):5211-5225. doi: 10.1109/TNNLS.2022.3202835. Epub 2024 Apr 4.

Abstract

The popularity of wearable devices has increased the demands for the research on first-person activity recognition. However, most of the current first-person activity datasets are built based on the assumption that only the human-object interaction (HOI) activities, performed by the camera-wearer, are captured in the field of view. Since humans live in complicated scenarios, in addition to the first-person activities, it is likely that third-person activities performed by other people also appear. Analyzing and recognizing these two types of activities simultaneously occurring in a scene is important for the camera-wearer to understand the surrounding environments. To facilitate the research on concurrent first- and third-person activity recognition (CFT-AR), we first created a new activity dataset, namely PolyU concurrent first- and third-person (CFT) Daily, which exhibits distinct properties and challenges, compared with previous activity datasets. Since temporal asynchronism and appearance gap usually exist between the first- and third-person activities, it is crucial to learn robust representations from all the activity-related spatio-temporal positions. Thus, we explore both holistic scene-level and local instance-level (person-level) features to provide comprehensive and discriminative patterns for recognizing both first- and third-person activities. On the one hand, the holistic scene-level features are extracted by a 3-D convolutional neural network, which is trained to mine shared and sample-unique semantics between video pairs, via two well-designed attention-based modules and a self-knowledge distillation (SKD) strategy. On the other hand, we further leverage the extracted holistic features to guide the learning of instance-level features in a disentangled fashion, which aims to discover both spatially conspicuous patterns and temporally varied, yet critical, cues. Experimental results on the PolyU CFT Daily dataset validate that our method achieves the state-of-the-art performance.

摘要

可穿戴设备的普及增加了对第一人称活动识别研究的需求。然而,当前大多数第一人称活动数据集都是基于这样的假设构建的,即只有在摄像机佩戴者的视野范围内才会捕获到人与物体的交互(HOI)活动。由于人类生活在复杂的场景中,除了第一人称活动外,其他人执行的第三人称活动也很可能出现。分析和识别场景中同时发生的这两种类型的活动对于摄像机佩戴者理解周围环境非常重要。为了促进同时进行的第一人称和第三人称活动识别(CFT-AR)的研究,我们首先创建了一个新的活动数据集,即 PolyU 同时进行的第一人称和第三人称(CFT)日常活动数据集,与之前的活动数据集相比,它具有独特的性质和挑战。由于第一人称和第三人称活动之间通常存在时间异步和外观差距,因此从所有与活动相关的时空位置学习鲁棒表示至关重要。因此,我们探索整体场景级和局部实例级(人员级)特征,为识别第一人称和第三人称活动提供全面和有区别的模式。一方面,整体场景级特征通过 3D 卷积神经网络提取,该网络通过两个精心设计的基于注意力的模块和自知识蒸馏(SKD)策略进行训练,以挖掘视频对之间的共享和样本独特语义。另一方面,我们进一步利用提取的整体特征以解耦的方式引导实例级特征的学习,旨在发现空间上显著的模式和时间上变化但关键的线索。在 PolyU CFT 日常活动数据集上的实验结果验证了我们的方法达到了最先进的性能。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验