IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1852-1863. doi: 10.1109/TNNLS.2019.2962815. Epub 2023 Apr 4.
The point process is a solid framework to model sequential data, such as videos, by exploring the underlying relevance. As a challenging problem for high-level video understanding, weakly supervised action recognition and localization in untrimmed videos have attracted intensive research attention. Knowledge transfer by leveraging the publicly available trimmed videos as external guidance is a promising attempt to make up for the coarse-grained video-level annotation and improve the generalization performance. However, unconstrained knowledge transfer may bring about irrelevant noise and jeopardize the learning model. This article proposes a novel adaptability decomposing encoder-decoder network to transfer reliable knowledge between the trimmed and untrimmed videos for action recognition and localization by bidirectional point process modeling, given only video-level annotations. By decomposing the original features into the domain-adaptable and domain-specific ones based on their adaptability, trimmed-untrimmed knowledge transfer can be safely confined within a more coherent subspace. An encoder-decoder-based structure is carefully designed and jointly optimized to facilitate effective action classification and temporal localization. Extensive experiments are conducted on two benchmark data sets (i.e., THUMOS14 and ActivityNet1.3), and the experimental results clearly corroborate the efficacy of our method.
点过程是一种用于建模顺序数据(例如视频)的稳健框架,通过探索潜在的相关性来实现。作为高级视频理解的一个具有挑战性的问题,未修剪视频中的弱监督动作识别和定位引起了广泛的研究关注。通过利用公开的修剪视频作为外部指导进行知识迁移是一种很有前途的尝试,可以弥补粗粒度的视频级注释并提高泛化性能。然而,无约束的知识迁移可能会带来不相关的噪声,并危及学习模型。本文提出了一种新颖的适应性分解编码器-解码器网络,通过双向点过程建模,仅使用视频级注释,在修剪和未修剪的视频之间进行可靠知识的迁移,用于动作识别和定位。通过基于适应性将原始特征分解为域自适应和特定于域的特征,可以将修剪-未修剪的知识迁移安全地限制在更一致的子空间内。精心设计了基于编码器-解码器的结构,并进行联合优化,以促进有效的动作分类和时间定位。在两个基准数据集(即 THUMOS14 和 ActivityNet1.3)上进行了广泛的实验,实验结果清楚地证实了我们方法的有效性。