Hong Lingyi, Zhang Wei, Chen Liangyu, Zhang Wenqiang, Fan Jianping
IEEE Trans Image Process. 2022;31:1057-1071. doi: 10.1109/TIP.2021.3137660. Epub 2022 Jan 19.
Video object segmentation is a challenging task in computer vision because the appearances of target objects might change drastically along the time in the video. To solve this problem, space-time memory (STM) networks are exploited to make use of the information from all the intermediate frames between the first frame and the current frame in the video. However, fully using the information from all the memory frames may make STM not practical for long videos. To overcome this issue, a novel method is developed in this paper to select the reference frames adaptively. First, an adaptive selection criterion is introduced to choose the reference frames with similar appearance and precise mask estimation, which can efficiently capture the rich information of the target object and overcome the challenges of appearance changes, occlusion, and model drift. Secondly, bi-matching (bi-scale and bi-direction) is conducted to obtain more robust correlations for objects of various scales and prevents multiple similar objects in the current frame from being mismatched with the same target object in the reference frame. Thirdly, a novel edge refinement technique is designed by using an edge detection network to obtain smooth edges from the outputs of edge confidence maps, where the edge confidence is quantized into ten sub-intervals to generate smooth edges step by step. Experimental results on the challenging benchmark datasets DAVIS-2016, DAVIS-2017, YouTube-VOS, and a Long-Video dataset have demonstrated the effectiveness of our proposed approach to video object segmentation.
视频目标分割是计算机视觉中的一项具有挑战性的任务,因为目标物体的外观可能会在视频中随时间发生剧烈变化。为了解决这个问题,人们利用时空记忆(STM)网络来利用视频中第一帧和当前帧之间所有中间帧的信息。然而,充分利用所有记忆帧的信息可能会使STM在处理长视频时不实用。为了克服这个问题,本文开发了一种新颖的方法来自适应地选择参考帧。首先,引入了一种自适应选择标准,以选择具有相似外观和精确掩码估计的参考帧,这可以有效地捕捉目标物体的丰富信息,并克服外观变化、遮挡和模型漂移等挑战。其次,进行双匹配(双尺度和双向),以获得各种尺度物体更稳健的相关性,并防止当前帧中的多个相似物体与参考帧中的同一目标物体不匹配。第三,设计了一种新颖的边缘细化技术,通过使用边缘检测网络从边缘置信度图的输出中获得平滑边缘,其中边缘置信度被量化为十个子区间,以逐步生成平滑边缘。在具有挑战性的基准数据集DAVIS-2016、DAVIS-2017、YouTube-VOS和一个长视频数据集上的实验结果证明了我们提出的视频目标分割方法的有效性。