Yang Le, Han Junwei, Zhang Dingwen, Liu Nian, Zhang Dong
IEEE Trans Image Process. 2018 May 16. doi: 10.1109/TIP.2018.2834221.
Weakly supervised video object segmentation (WSVOS) focuses on generating pixel-level object masks for videos only tagged with class labels, which is an essential yet challenging task. For WSVOS, the algorithm is just aware of rough category information rather than the concrete object size and location cues, besides it lacks reliable annotated exemplars to learn temporal evolution in the investigated videos. Basically, there are three challenging factors which may influence the performance of WSVOS: foreground object discovery in each frame, coarse object semantic consistency within each video, and fine-grained segmentation smoothness within neighbor frames. In this paper, we establish a semantic ranking and optical warping network (SROWN) to simultaneously solve these three challenges in a unified framework. For the first challenge, we apply the still image saliency detection method and discover the foreground object for each frame via a segmentation network. Due to the huge discrepancies between the image saliency and the video object segmentation, we step further and propose two subnetworks to solve the other two challenges. For the second one, we propose an attentive semantic ranking subnetwork to mine video-level tags, which can learn discriminative features for semantic ranking and lead to semantic consistent segmentation masks. For the third one, we propose an optical flow warping subnetwork to constrain fine-grained segmentation smoothness within neighbor frames, which can suppress the large deformation and thus obtain smooth object boundaries for adjacent frames. Experiments on two benchmark datasets, i.e., DAVIS dataset and YouTube-Objects dataset, demonstrate the effectiveness of the proposed approach for segmenting out video objects under weak supervision.
弱监督视频对象分割(WSVOS)专注于为仅标记有类别标签的视频生成像素级对象掩码,这是一项重要但具有挑战性的任务。对于WSVOS,算法仅了解粗略的类别信息,而不是具体的对象大小和位置线索,此外,它缺乏可靠的注释示例来学习所研究视频中的时间演变。基本上,有三个具有挑战性的因素可能会影响WSVOS的性能:每一帧中的前景对象发现、每个视频内的粗略对象语义一致性以及相邻帧内的细粒度分割平滑度。在本文中,我们建立了一个语义排序和光流扭曲网络(SROWN),以在一个统一的框架中同时解决这三个挑战。对于第一个挑战,我们应用静态图像显著性检测方法,并通过分割网络发现每一帧中的前景对象。由于图像显著性和视频对象分割之间存在巨大差异,我们进一步提出了两个子网来解决其他两个挑战。对于第二个挑战,我们提出了一个注意力语义排序子网来挖掘视频级标签,可以学习用于语义排序的判别特征,并生成语义一致的分割掩码。对于第三个挑战,我们提出了一个光流扭曲子网来约束相邻帧内的细粒度分割平滑度,可以抑制大变形,从而获得相邻帧的平滑对象边界。在两个基准数据集上的实验,即DAVIS数据集和YouTube-Objects数据集,证明了所提出的方法在弱监督下分割视频对象的有效性。