CCNet：用于语义分割的交叉注意力。

CCNet: Criss-Cross Attention for Semantic Segmentation.

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):6896-6908. doi: 10.1109/TPAMI.2020.3007032. Epub 2023 May 5.

DOI:10.1109/TPAMI.2020.3007032

Abstract

Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a criss-cross network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11× less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85 percent of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9, 45.76 and 55.47 percent on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at https://github.com/speedinghzl/CCNethttps://github.com/speedinghzl/CCNet.

摘要

上下文信息在视觉理解问题中至关重要，例如语义分割和目标检测。我们提出了一种交叉网络（CCNet），以非常有效和高效的方式获取全图上下文信息。具体来说，对于每个像素，一个新颖的交叉注意力模块会获取其交叉路径上所有像素的上下文信息。通过进一步的递归操作，每个像素最终可以捕获全图的依赖关系。此外，还提出了类别一致性损失，以强制交叉注意力模块生成更具鉴别力的特征。总体而言，CCNet 具有以下优点：1）GPU 内存友好。与非局部块相比，所提出的递归交叉注意力模块所需的 GPU 内存使用量减少了 11 倍。2）计算效率高。递归交叉注意力大大减少了大约 85%的 FLOPs，比非局部块少。3）最先进的性能。我们在语义分割基准上进行了广泛的实验，包括 Cityscapes、ADE20K、人体解析基准 LIP、实例分割基准 COCO 和视频分割基准 CamVid。特别是，我们的 CCNet 在 Cityscapes 测试集、ADE20K 验证集和 LIP 验证集上分别达到了 81.9%、45.76%和 55.47%的 mIoU 得分，这是最新的最先进的结果。代码可在 https://github.com/speedinghzl/CCNet 上获取。