Department of Key Laboratory of Jiangxi Province for Image Processing and Pattern Recognition, Nanchang Hangkong University, Nanchang 330063, China.
School of software, Nanchang Hangkong University, Nanchang 330063, China.
Sensors (Basel). 2020 Feb 13;20(4):1010. doi: 10.3390/s20041010.
With the rapid development of flexible vision sensors and visual sensor networks, computer vision tasks, such as object detection and tracking, are entering a new phase. Accordingly, the more challenging comprehensive task, including instance segmentation, can develop rapidly. Most state-of-the-art network frameworks, for instance, segmentation, are based on Mask R-CNN (mask region-convolutional neural network). However, the experimental results confirm that Mask R-CNN does not always successfully predict instance details. The scale-invariant fully convolutional network structure of Mask R-CNN ignores the difference in spatial information between receptive fields of different sizes. A large-scale receptive field focuses more on detailed information, whereas a small-scale receptive field focuses more on semantic information. So the network cannot consider the relationship between the pixels at the object edge, and these pixels will be misclassified. To overcome this problem, Mask-Refined R-CNN (MR R-CNN) is proposed, in which the stride of ROIAlign (region of interest align) is adjusted. In addition, the original fully convolutional layer is replaced with a new semantic segmentation layer that realizes feature fusion by constructing a feature pyramid network and summing the forward and backward transmissions of feature maps of the same resolution. The segmentation accuracy is substantially improved by combining the feature layers that focus on the global and detailed information. The experimental results on the COCO (Common Objects in Context) and Cityscapes datasets demonstrate that the segmentation accuracy of MR R-CNN is about 2% higher than that of Mask R-CNN using the same backbone. The average precision of large instances reaches 56.6%, which is higher than those of all state-of-the-art methods. In addition, the proposed method requires low time cost and is easily implemented. The experiments on the Cityscapes dataset also prove that the proposed method has great generalization ability.
随着柔性视觉传感器和视觉传感器网络的快速发展,计算机视觉任务,如目标检测和跟踪,正在进入一个新的阶段。因此,包括实例分割在内的更具挑战性的综合任务可以迅速发展。大多数最先进的网络框架,例如分割,都是基于 Mask R-CNN(掩模区域卷积神经网络)。然而,实验结果证实,Mask R-CNN 并不总是成功地预测实例细节。Mask R-CNN 的尺度不变全卷积网络结构忽略了不同大小感受野之间的空间信息差异。大感受野更关注详细信息,而小感受野更关注语义信息。因此,网络不能考虑物体边缘像素之间的关系,这些像素会被错误分类。为了解决这个问题,提出了 Mask-Refined R-CNN(MR R-CNN),其中调整了 ROIAlign(感兴趣区域对齐)的步长。此外,原始的全卷积层被替换为一个新的语义分割层,通过构建特征金字塔网络和对同一分辨率的特征图的前向和后向传输进行求和,实现特征融合。通过结合关注全局和详细信息的特征层,大大提高了分割精度。在 COCO(上下文中的常见对象)和 Cityscapes 数据集上的实验结果表明,使用相同的骨干网络,MR R-CNN 的分割精度比 Mask R-CNN 高约 2%。大实例的平均精度达到 56.6%,高于所有最先进方法的精度。此外,该方法需要的时间成本低,易于实现。在 Cityscapes 数据集上的实验也证明了该方法具有很强的泛化能力。