Yin Yingjie, Xu De, Wang Xingang, Zhang Lei
IEEE Trans Neural Netw Learn Syst. 2022 Aug;33(8):3884-3894. doi: 10.1109/TNNLS.2021.3054769. Epub 2022 Aug 3.
Most recent semisupervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module (GDMM), which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model-based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the GDMM and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits the state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 data set and an overall score G of 71.3% on the large-scale YouTube-VOS data set, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss.
最近的大多数半监督视频目标分割(VOS)方法依赖于使用第一帧的给定掩码或后续帧的预测掩码在线微调深度卷积神经网络。然而,在线微调过程通常很耗时,限制了此类方法的实际应用。我们提出了一种用于快速VOS的定向深度嵌入和外观学习(DDEAL)方法,该方法无需在线微调过程。首先,提出了一种全局定向匹配模块(GDMM),它可以通过并行卷积操作有效地实现,以学习语义像素级嵌入作为内部指导。其次,提出了一种基于有效定向外观模型的统计方法,用于在球形嵌入空间上表示VOS的目标和背景。配备GDMM和定向外观模型学习模块,DDEAL从标记的第一帧学习静态线索,并动态更新后续帧的线索以进行目标分割。我们的方法在不使用在线微调的情况下展现出了当前最优的VOS性能。具体而言,它在DAVIS 2017数据集上实现了74.8%的J&F平均分数,在大规模YouTube-VOS数据集上实现了71.3%的总体分数G,同时在单个NVIDIA TITAN Xp GPU上保持25帧/秒的速度。此外,我们的更快版本以仅略微的精度损失运行31帧/秒。