Zheng Yixiao, Xie Jiyang, Sain Aneeshan, Song Yi-Zhe, Ma Zhanyu
IEEE Trans Image Process. 2023;32:4595-4609. doi: 10.1109/TIP.2023.3302521. Epub 2023 Aug 16.
Sketch is a well-researched topic in the vision community by now. Sketch semantic segmentation in particular, serves as a fundamental step towards finer-level sketch interpretation. Recent works use various means of extracting discriminative features from sketches and have achieved considerable improvements on segmentation accuracy. Common approaches for this include attending to the sketch-image as a whole, its stroke-level representation or the sequence information embedded in it. However, they mostly focus on only a part of such multi-facet information. In this paper, we for the first time demonstrate that there is complementary information to be explored across all these three facets of sketch data, and that segmentation performance consequently benefits as a result of such exploration of sketch-specific information. Specifically, we propose the Sketch-Segformer, a transformer-based framework for sketch semantic segmentation that inherently treats sketches as stroke sequences other than pixel-maps. In particular, Sketch-Segformer introduces two types of self-attention modules having similar structures that work with different receptive fields (i.e., whole sketch or individual stroke). The order embedding is then further synergized with spatial embeddings learned from the entire sketch as well as localized stroke-level information. Extensive experiments show that our sketch-specific design is not only able to obtain state-of-the-art performance on traditional figurative sketches (such as SPG, SketchSeg-150K datasets), but also performs well on creative sketches that do not conform to conventional object semantics (CreativeSketch dataset) thanks for our usage of multi-facet sketch information. Ablation studies, visualizations, and invariance tests further justifies our design choice and the effectiveness of Sketch-Segformer. Codes are available at https://github.com/PRIS-CV/Sketch-SF.
到目前为止,草图是视觉领域中一个经过充分研究的主题。特别是草图语义分割,是迈向更精细层次草图解释的基础步骤。近期的工作使用了各种从草图中提取判别特征的方法,并在分割精度上取得了显著提升。常见的方法包括将草图图像作为一个整体来处理、其笔触级别的表示或其中嵌入的序列信息。然而,它们大多只关注了这些多方面信息中的一部分。在本文中,我们首次证明在草图数据的所有这三个方面都存在有待探索的互补信息,并且由于对草图特定信息的这种探索,分割性能因此受益。具体而言,我们提出了Sketch-Segformer,这是一个基于Transformer的草图语义分割框架,它本质上把草图当作笔触序列而非像素映射来处理。特别地,Sketch-Segformer引入了两种具有相似结构但不同感受野(即整个草图或单个笔触)的自注意力模块。然后,顺序嵌入与从整个草图以及局部笔触级别信息中学习到的空间嵌入进一步协同。大量实验表明,我们针对草图的特定设计不仅能够在传统具象草图(如SPG、SketchSeg-150K数据集)上获得当前最优的性能,而且由于我们对多方面草图信息的使用,在不符合传统对象语义的创意草图(CreativeSketch数据集)上也表现良好。消融研究、可视化和不变性测试进一步证明了我们的设计选择以及Sketch-Segformer的有效性。代码可在https://github.com/PRIS-CV/Sketch-SF获取。