IEEE Trans Pattern Anal Mach Intell. 2023 Mar;45(3):3848-3861. doi: 10.1109/TPAMI.2022.3183586. Epub 2023 Feb 3.
An integral part of video analysis and surveillance is temporal activity detection, which means to simultaneously recognize and localize activities in long untrimmed videos. Currently, the most effective methods of temporal activity detection are based on deep learning, and they typically perform very well with large scale annotated videos for training. However, these methods are limited in real applications due to the unavailable videos about certain activity classes and the time-consuming data annotation. To solve this challenging problem, we propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training still need to be detected. We design an end-to-end deep transferable network TN-ZSTAD as the architecture for this solution. On the one hand, this network utilizes an activity graph transformer to predict a set of activity instances that appear in the video, rather than produces many activity proposals in advance. On the other hand, this network captures the common semantics of seen and unseen activities from their corresponding label embeddings, and it is optimized with an innovative loss function that considers the classification property on seen activities and the transfer property on unseen activities together. Experiments on the THUMOS'14, Charades, and ActivityNet datasets show promising performance in terms of detecting unseen activities.
视频分析和监控的一个重要组成部分是时间活动检测,这意味着要在未经剪辑的长视频中同时识别和定位活动。目前,时间活动检测最有效的方法基于深度学习,并且它们通常在用于训练的大规模标注视频上表现得非常好。然而,由于某些活动类别的视频不可用以及数据标注的耗时,这些方法在实际应用中受到限制。为了解决这个具有挑战性的问题,我们提出了一种名为零样本时间活动检测(ZSTAD)的新任务设置,其中仍然需要检测在训练中从未见过的活动。我们设计了一个端到端的深度可迁移网络 TN-ZSTAD 作为该解决方案的架构。一方面,该网络利用活动图转换器来预测出现在视频中的一组活动实例,而不是预先生成许多活动提议。另一方面,该网络从相应的标签嵌入中捕获可见和不可见活动的共同语义,并使用一种创新的损失函数进行优化,该函数同时考虑了可见活动的分类属性和不可见活动的迁移属性。在 THUMOS'14、Charades 和 ActivityNet 数据集上的实验表明,在检测不可见活动方面具有有前景的性能。