Zhang Xiao-Yu, Shi Haichao, Li Changsheng, Shi Xinchu
IEEE Trans Image Process. 2022;31:4447-4457. doi: 10.1109/TIP.2022.3185485. Epub 2022 Jul 1.
Weakly supervised action localization is a challenging task with extensive applications, which aims to identify actions and the corresponding temporal intervals with only video-level annotations available. This paper analyzes the order-sensitive and location-insensitive properties of actions, and embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance. To be specific, we propose a novel two-branch network architecture with intra/inter-action shuffling, referred to as ActShufNet. The intra-action shuffling branch lays out a self-supervised order prediction task to augment the video representation with inner-video relevance, whereas the inter-action shuffling branch imposes a reorganizing strategy on the existing action contents to augment the training set without resorting to any external resources. Furthermore, the global-local adversarial training is presented to enhance the model's robustness to irrelevant noises. Extensive experiments are conducted on three benchmark datasets, and the results clearly demonstrate the efficacy of the proposed method.
弱监督动作定位是一项具有广泛应用的挑战性任务,其旨在仅利用可用的视频级注释来识别动作及相应的时间间隔。本文分析了动作的顺序敏感性和位置不敏感性属性,并将它们体现在一个自增强学习框架中,以提高弱监督动作定位性能。具体而言,我们提出了一种具有动作内/动作间洗牌操作的新型双分支网络架构,称为ActShufNet。动作内洗牌分支布置了一个自监督顺序预测任务,以通过视频内相关性增强视频表示,而动作间洗牌分支对现有动作内容施加一种重组策略,以在不借助任何外部资源的情况下扩充训练集。此外,还提出了全局-局部对抗训练以增强模型对无关噪声的鲁棒性。在三个基准数据集上进行了广泛实验,结果清楚地证明了所提方法的有效性。