IEEE Trans Pattern Anal Mach Intell. 2017 Oct;39(10):2074-2088. doi: 10.1109/TPAMI.2016.2612187. Epub 2016 Oct 26.
We present a spatio-temporal energy minimization formulation for simultaneous video object discovery and co-segmentation across multiple videos containing irrelevant frames. Our approach overcomes a limitation that most existing video co-segmentation methods possess, i.e., they perform poorly when dealing with practical videos in which the target objects are not present in many frames. Our formulation incorporates a spatio-temporal auto-context model, which is combined with appearance modeling for superpixel labeling. The superpixel-level labels are propagated to the frame level through a multiple instance boosting algorithm with spatial reasoning, based on which frames containing the target object are identified. Our method only needs to be bootstrapped with the frame-level labels for a few video frames (e.g., usually 1 to 3) to indicate if they contain the target objects or not. Extensive experiments on four datasets validate the efficacy of our proposed method: 1) object segmentation from a single video on the SegTrack dataset, 2) object co-segmentation from multiple videos on a video co-segmentation dataset, and 3) joint object discovery and co-segmentation from multiple videos containing irrelevant frames on the MOViCS dataset and XJTU-Stevens, a new dataset that we introduce in this paper. The proposed method compares favorably with the state-of-the-art in all of these experiments.
我们提出了一种用于跨多个包含不相关帧的视频进行同时视频对象发现和共同分割的时空能量最小化公式。我们的方法克服了大多数现有的视频共同分割方法的一个限制,即当处理实际视频时,它们的性能较差,因为目标对象在许多帧中都不存在。我们的公式结合了时空自上下文模型,该模型与超像素标记的外观建模相结合。通过基于空间推理的多实例提升算法将超像素级别的标签传播到帧级别,根据该算法可以识别包含目标对象的帧。我们的方法只需要用几个视频帧(例如,通常为 1 到 3 个)的帧级标签进行引导,以指示它们是否包含目标对象。在四个数据集上的广泛实验验证了我们提出的方法的有效性:1)在 SegTrack 数据集上的单个视频中的对象分割,2)在视频共同分割数据集上的多个视频中的对象共同分割,以及 3)在包含不相关帧的多个视频上的联合对象发现和共同分割 MOViCS 数据集和我们在本文中引入的新数据集 XJTU-Stevens。该方法在所有这些实验中都优于最先进的方法。