Lin Weihao, Chen Tao, Yu Chong
IEEE Trans Image Process. 2023;32:5977-5991. doi: 10.1109/TIP.2023.3327588. Epub 2023 Nov 7.
Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing Semi-VOS pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.
半监督视频对象分割(Semi-VOS),只需要标注视频的第一帧就能分割后续帧,最近受到了越来越多的关注。在现有的Semi-VOS流程中,基于内存匹配的流程正成为主要的研究方向,因为它可以充分利用时间序列信息来获得高质量的分割结果。尽管这类方法已经取得了不错的性能,但整体框架仍然存在大量的计算开销,主要是由高分辨率特征图和每个内核滤波器之间的逐帧密集卷积操作导致的。因此,我们在这项工作中提出了一种名为SpVOS的VOS稀疏基线,它开发了一种新颖的三重稀疏卷积来降低整个VOS框架的计算成本。设计的三重门充分考虑了相邻视频帧之间的空间和时间冗余,自适应地做出三重决策,以决定如何在每个像素上应用稀疏卷积来控制每一层的计算开销,同时保持足够的辨别能力来区分相似对象并避免误差积累。还开发了一种混合稀疏训练策略,结合考虑稀疏约束的设计目标,以平衡VOS分割性能和计算成本。在包括DAVIS和Youtube-VOS在内的两个主流VOS数据集上进行了实验。结果表明,所提出的SpVOS比其他现有的稀疏方法具有更优的性能,甚至保持了可比的性能,例如在DAVIS-2017(Youtube-VOS)验证集上的总体得分达到83.04%(79.29%),与典型的非稀疏VOS基线(DAVIS-2017为82.88%,Youtube-VOS为80.36%)相当,同时节省了高达42%的浮点运算量,显示了其在资源受限场景中的应用潜力。