Li Jing, Fan Junsong, Wang Yuxi, Yang Yuran, Zhang Zhaoxiang
IEEE Trans Image Process. 2023;32:5808-5822. doi: 10.1109/TIP.2023.3322564. Epub 2023 Oct 26.
Interactive object segmentation aims to produce object masks with user interactions, such as clicks, bounding boxes, and scribbles. Click point is the most popular interactive cue for its efficiency, and related deep learning methods have attracted lots of interest in recent years. Most works encode click points as gaussian maps and concatenate them with images as the model's input. However, the spatial and semantic information of gaussian maps would be noised through multiple convolution layers and won't be fully exploited by top layers for mask prediction. To pass click information to top layers exactly and efficiently, we propose a coarse mask guided model (CMG) which predicts coarse masks with a coarse module to guide the object mask prediction. Specifically, the coarse module encodes user clicks as query features and enriches their semantic information with backbone features through transformer layers, coarse masks are generated based on the enriched query feature and fed into CMG's decoder. Benefiting from the efficiency of transformer, CMG's coarse module and decoder module are lightweight and computationally efficient, making the interaction process more smooth. Experiments on several segmentation benchmarks demonstrate the effectiveness of our method, and we get new state-of-the-art results compared with previous works.
交互式目标分割旨在通过用户交互(如点击、边界框和涂鸦)生成目标掩码。点击点因其效率而成为最流行的交互线索,近年来相关的深度学习方法引起了广泛关注。大多数工作将点击点编码为高斯图,并将其与图像连接作为模型的输入。然而,高斯图的空间和语义信息会在多个卷积层中被噪声干扰,并且顶层不会充分利用这些信息进行掩码预测。为了准确且高效地将点击信息传递到顶层,我们提出了一种粗掩码引导模型(CMG),该模型使用一个粗模块预测粗掩码来指导目标掩码预测。具体来说,粗模块将用户点击编码为查询特征,并通过Transformer层利用主干特征丰富其语义信息,基于丰富后的查询特征生成粗掩码并输入到CMG的解码器中。受益于Transformer的效率,CMG的粗模块和解码器模块轻量级且计算高效,使得交互过程更加流畅。在多个分割基准上的实验证明了我们方法的有效性,与之前的工作相比,我们取得了新的最优结果。