Han Mingfei, Wang Yali, Li Mingjie, Chang Xiaojun, Yang Yi, Qiao Yu
IEEE Trans Image Process. 2024;33:1560-1573. doi: 10.1109/TIP.2024.3364536. Epub 2024 Feb 27.
In this paper, we focus on the weakly supervised video object detection problem, where each training video is only tagged with object labels, without any bounding box annotations of objects. To effectively train object detectors from such weakly-annotated videos, we propose a Progressive Frame-Proposal Mining (PFPM) framework by exploiting discriminative proposals in a coarse-to-fine manner. First, we design a flexible Multi-Level Selection (MLS) scheme, with explicit guidance of video tags. By selecting object-relevant frames and mining important proposals from these frames, the proposed MLS can effectively reduce frame redundancy as well as improve proposal effectiveness to boost weakly-supervised detectors. Moreover, we develop a novel Holistic-View Refinement (HVR) scheme, which can globally evaluate importance of proposals among frames, and thus correctly refine pseudo ground truth boxes for training video detectors in a self-supervised manner. Finally, we evaluate the proposed PFPM on a large-scale benchmark for video object detection, on ImageNet VID, under the setting of weak annotations. The experimental results demonstrate that our PFPM significantly outperforms the state-of-the-art weakly-supervised detectors.
在本文中,我们聚焦于弱监督视频目标检测问题,其中每个训练视频仅用目标标签进行标注,而没有任何目标的边界框注释。为了从这类弱标注视频中有效地训练目标检测器,我们提出了一种渐进式帧提议挖掘(PFPM)框架,通过从粗到细的方式利用有区分性的提议。首先,我们设计了一种灵活的多级选择(MLS)方案,并在视频标签的明确指导下。通过选择与目标相关的帧并从这些帧中挖掘重要提议,所提出的 MLS 可以有效减少帧冗余并提高提议有效性,以提升弱监督检测器。此外,我们开发了一种新颖的整体视图细化(HVR)方案,它可以全局评估帧之间提议的重要性,从而以自监督的方式正确地细化用于训练视频检测器的伪真实框。最后,我们在 ImageNet VID 上弱注释设置下的大规模视频目标检测基准上评估所提出的 PFPM。实验结果表明,我们的 PFPM 显著优于当前最先进的弱监督检测器。