Zhang Qian, Chen Lu, Shao Mingwen, Liang Hong, Ren Jie
College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China.
Sensors (Basel). 2023 Jul 16;23(14):6446. doi: 10.3390/s23146446.
Instance segmentation is a challenging task in computer vision, as it requires distinguishing objects and predicting dense areas. Currently, segmentation models based on complex designs and large parameters have achieved remarkable accuracy. However, from a practical standpoint, achieving a balance between accuracy and speed is even more desirable. To address this need, this paper presents ESAMask, a real-time segmentation model fused with efficient sparse attention, which adheres to the principles of lightweight design and efficiency. In this work, we propose several key contributions. Firstly, we introduce a dynamic and sparse Related Semantic Perceived Attention mechanism (RSPA) for adaptive perception of different semantic information of various targets during feature extraction. RSPA uses the adjacency matrix to search for regions with high semantic correlation of the same target, which reduces computational cost. Additionally, we design the GSInvSAM structure to reduce redundant calculations of spliced features while enhancing interaction between channels when merging feature layers of different scales. Lastly, we introduce the Mixed Receptive Field Context Perception Module (MRFCPM) in the prototype branch to enable targets of different scales to capture the feature representation of the corresponding area during mask generation. MRFCPM fuses information from three branches of global content awareness, large kernel region awareness, and convolutional channel attention to explicitly model features at different scales. Through extensive experimental evaluation, ESAMask achieves a mask AP of 45.4 at a frame rate of 45.2 FPS on the COCO dataset, surpassing current instance segmentation methods in terms of the accuracy-speed trade-off, as demonstrated by our comprehensive experimental results. In addition, the high-quality segmentation results of our proposed method for objects of various classes and scales can be intuitively observed from the visualized segmentation outputs.
实例分割是计算机视觉中的一项具有挑战性的任务,因为它需要区分物体并预测密集区域。目前,基于复杂设计和大量参数的分割模型已经取得了显著的精度。然而,从实际角度来看,在精度和速度之间取得平衡更为可取。为了满足这一需求,本文提出了ESAMask,一种融合了高效稀疏注意力的实时分割模型,它遵循轻量级设计和高效的原则。在这项工作中,我们提出了几个关键贡献。首先,我们引入了一种动态稀疏的相关语义感知注意力机制(RSPA),用于在特征提取过程中自适应地感知不同目标的不同语义信息。RSPA使用邻接矩阵搜索同一目标的高语义相关区域,从而降低计算成本。此外,我们设计了GSInvSAM结构,以减少拼接特征的冗余计算,同时在合并不同尺度的特征层时增强通道之间的交互。最后,我们在原型分支中引入了混合感受野上下文感知模块(MRFCPM),以使不同尺度的目标在掩码生成过程中能够捕捉相应区域的特征表示。MRFCPM融合了来自全局内容感知、大核区域感知和卷积通道注意力三个分支的信息,以显式地对不同尺度的特征进行建模。通过广泛的实验评估,ESAMask在COCO数据集上以45.2 FPS的帧率实现了45.4的掩码平均精度均值(mask AP),在精度-速度权衡方面超过了当前的实例分割方法,我们的综合实验结果证明了这一点。此外,从可视化的分割输出中可以直观地观察到我们提出的方法对各种类别和尺度的物体的高质量分割结果。