IEEE Trans Neural Netw Learn Syst. 2021 Jun;32(6):2547-2560. doi: 10.1109/TNNLS.2020.3006524. Epub 2021 Jun 2.
In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How to efficiently exploit context is essential for pixel-level recognition. To address the issue, we adaptively capture contextual information based on the relation-aware attention mechanism. Especially, we append two types of attention modules on the top of the dilated fully convolutional network (FCN), which model the contextual dependencies in spatial and channel dimensions, respectively. In the attention modules, we adopt a self-attention mechanism to model semantic associations between any two pixels or channels. Each pixel or channel can adaptively aggregate context from all pixels or channels according to their correlations. To reduce the high cost of computation and memory caused by the abovementioned pairwise association computation, we further design two types of compact attention modules. In the compact attention modules, each pixel or channel is built into association only with a few numbers of gathering centers and obtains corresponding context aggregation over these gathering centers. Meanwhile, we add a cross-level gating decoder to selectively enhance spatial details that boost the performance of the network. We conduct extensive experiments to validate the effectiveness of our network and achieve new state-of-the-art segmentation performance on four challenging scene segmentation data sets, i.e., Cityscapes, ADE20K, PASCAL Context, and COCO Stuff data sets. In particular, a Mean IoU score of 82.9% on the Cityscapes test set is achieved without using extra coarse annotated data.
本文提出了一种双重关系感知注意力网络(DRANet)来处理场景分割任务。如何有效地利用上下文信息对于像素级别的识别至关重要。为了解决这个问题,我们基于关系感知注意力机制自适应地捕获上下文信息。特别是,我们在扩张全卷积网络(FCN)的顶部添加了两种类型的注意力模块,分别在空间和通道维度上建模上下文依赖关系。在注意力模块中,我们采用自注意力机制来建模任意两个像素或通道之间的语义关联。每个像素或通道可以根据其相关性自适应地从所有像素或通道中聚合上下文信息。为了降低上述两两关联计算带来的高计算和内存成本,我们进一步设计了两种类型的紧凑注意力模块。在紧凑注意力模块中,每个像素或通道仅与少数几个汇聚中心建立关联,并通过这些汇聚中心获得相应的上下文聚合。同时,我们添加了一个跨层门控解码器,以选择性地增强空间细节,从而提高网络性能。我们进行了广泛的实验来验证我们的网络的有效性,并在四个具有挑战性的场景分割数据集(即 Cityscapes、ADE20K、PASCAL Context 和 COCO Stuff 数据集)上实现了新的最先进的分割性能。特别是,在不使用额外的粗标注数据的情况下,在 Cityscapes 测试集上达到了 82.9%的平均 IoU 得分。