School of Physics and Electronic-Electrical Engineering, ABA Teachers University, Wenchuan, Aba Tibetan and Qiang Autonomous Prefecture, Sichuan, China.
PLoS One. 2024 Oct 21;19(10):e0309286. doi: 10.1371/journal.pone.0309286. eCollection 2024.
With the continuous advancement of deep learning, research in scene text detection has evolved significantly. However, complex backgrounds and various text forms complicate the task of detecting text from images. CNN is a deep learning algorithm that automatically extracts features through convolution operation. In the task of scene text detection, it can capture local text features in images, but it lacks global attributes. In recent years, inspired by the application of transformers in the field of computer vision, it can capture the global information of images and describe them intuitively. Therefore, this paper proposes scene text detection based on dual perspective CNN-transformer. The channel enhanced self-attention module (CESAM) and spatial enhanced self-attention module (SESAM) proposed in this paper are integrated into the traditional ResNet backbone network. This integration effectively facilitates the learning of global contextual information and positional relationships of text, thereby alleviating the challenge of detecting small target text. Furthermore, this paper introduces a feature decoder designed to refine the effective text information within the feature map and enhance the perception of detailed information. Experiments show that the method proposed in this paper significantly improves the robustness of the model for different types of text detection. Compared to the baseline, it achieves performance improvements of 2.51% (83.81 vs. 81.3) on the Total-Text dataset, 1.87% (86.07 vs. 84.2) on the ICDAR 2015 dataset, and 3.63% (86.72 vs. 83.09) on the MSRA-TD500 dataset, while also demonstrating better visual effects.
随着深度学习的不断发展,场景文本检测的研究取得了显著的进展。然而,复杂的背景和各种形式的文本使得从图像中检测文本的任务变得复杂。CNN 是一种深度学习算法,通过卷积运算自动提取特征。在场景文本检测任务中,它可以捕获图像中的局部文本特征,但缺乏全局属性。近年来,受转换器在计算机视觉领域应用的启发,它可以捕获图像的全局信息并直观地描述它们。因此,本文提出了基于双视角 CNN-Transformer 的场景文本检测。本文提出的通道增强自注意力模块(CESAM)和空间增强自注意力模块(SESAM)集成到传统的 ResNet 骨干网络中。这种集成有效地促进了对文本全局上下文信息和位置关系的学习,从而缓解了检测小目标文本的挑战。此外,本文引入了一个特征解码器,旨在细化特征图内的有效文本信息,并增强对详细信息的感知。实验表明,本文提出的方法显著提高了模型对不同类型文本检测的鲁棒性。与基线相比,在 Total-Text 数据集上的性能提高了 2.51%(83.81 对 81.3),在 ICDAR 2015 数据集上提高了 1.87%(86.07 对 84.2),在 MSRA-TD500 数据集上提高了 3.63%(86.72 对 83.09),同时还展现出更好的视觉效果。