Xu Wanru, Miao Zhenjiang, Yu Jian, Ji Qiang
IEEE Trans Image Process. 2019 Sep 26. doi: 10.1109/TIP.2019.2942814.
Human activity localization aims at recognizing contents and detecting locations of activities in video sequences. With an increasing number of untrimmed video data, traditional activity localization methods always suffer from two major limitations. First, detailed annotations are needed in most existing methods, i.e., bounding-box annotations in every frame, which are both expensive and time consuming. Second, the search space is too large for 3D activity localization, which requires generating a large number of proposals. In this paper, we propose a unified deep Q-network with weak reward and weak loss (DWRLQN) to address the two problems. Certain weak knowledge and weak constraints involving the temporal dynamics of human activity are incorporated into a deep reinforcement learning framework under sparse spatial supervision, where we assume that only a portion of frames are annotated in each video sequence. Experiments on UCF-Sports, UCF-101 and sub-JHMDB demonstrate that our proposed model achieves promising performance by only utilizing a very small number of proposals. More importantly, our DWRLQN trained with partial annotations and weak information even outperforms fully supervised methods.
人类活动定位旨在识别视频序列中的活动内容并检测其位置。随着未修剪视频数据数量的不断增加,传统的活动定位方法一直存在两个主要局限性。首先,大多数现有方法都需要详细的注释,即在每一帧中进行边界框注释,这既昂贵又耗时。其次,对于三维活动定位而言,搜索空间太大,这需要生成大量的提议。在本文中,我们提出了一种具有弱奖励和弱损失的统一深度Q网络(DWRLQN)来解决这两个问题。在稀疏空间监督下,将涉及人类活动时间动态的某些弱知识和弱约束纳入深度强化学习框架,我们假设在每个视频序列中只有一部分帧被注释。在UCF-Sports、UCF-101和子JHMDB上的实验表明,我们提出的模型仅通过使用非常少量的提议就取得了有前景的性能。更重要的是,我们用部分注释和弱信息训练的DWRLQN甚至优于完全监督的方法。