Detsikas Nikolaos, Mitianoudis Nikolaos, Pratikakis Ioannis
Electrical and Computer Engineering Department, Democritus University of Thrace, University Campus Xanthi-Kimmeria, 67100 Xanthi, Greece.
J Imaging. 2024 May 21;10(6):125. doi: 10.3390/jimaging10060125.
A fundamental task in computer vision is the process of differentiation and identification of different objects or entities in a visual scene using semantic segmentation methods. The advancement of transformer networks has surpassed traditional convolutional neural network (CNN) architectures in terms of segmentation performance. The continuous pursuit of optimal performance, with respect to the popular evaluation metric results, has led to very large architectures that require a significant amount of computational power to operate, making them prohibitive for real-time applications, including autonomous driving. In this paper, we propose a model that leverages a visual transformer encoder with a parallel twin decoder, consisting of a visual transformer decoder and a CNN decoder with multi-resolution connections working in parallel. The two decoders are merged with the aid of two trainable CNN blocks, the fuser that combined the information from the two decoders and the scaler that scales the contribution of each decoder. The proposed model achieves state-of-the-art performance on the Cityscapes and ADE20K datasets, maintaining a low-complexity network that can be used in real-time applications.
计算机视觉中的一项基本任务是使用语义分割方法对视觉场景中的不同物体或实体进行区分和识别。在分割性能方面,Transformer网络的发展已经超越了传统的卷积神经网络(CNN)架构。对于流行的评估指标结果,对最优性能的持续追求导致了非常大的架构,这些架构需要大量的计算能力来运行,这使得它们对于包括自动驾驶在内的实时应用来说是难以承受的。在本文中,我们提出了一种模型,该模型利用带有并行双解码器的视觉Transformer编码器,并行双解码器由一个视觉Transformer解码器和一个具有多分辨率连接的CNN解码器组成。借助两个可训练的CNN模块,将两个解码器合并,即融合器(它组合来自两个解码器的信息)和缩放器(它缩放每个解码器的贡献)。所提出的模型在Cityscapes和ADE20K数据集上实现了当前最优的性能,同时保持了可用于实时应用的低复杂度网络。