Zhang Hui, Luo Guiyang, Kang Jian, Huang Shan, Wang Xiao, Wang Fei-Yue
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):10145-10158. doi: 10.1109/TNNLS.2023.3239696. Epub 2024 Jul 8.
Recent years have witnessed the growing popularity of connectionist temporal classification (CTC) and attention mechanism in scene text recognition (STR). CTC-based methods consume less time with few computational burdens, while they are not as effective as attention-based methods. To retain computational efficiency and effectiveness, we propose the global-local attention-augmented light Transformer (GLaLT), which adopts a Transformer-based encoder-decoder structure to orchestrate CTC and attention mechanism. The encoder integrates the self-attention module with the convolution module to augment the attention, where the self-attention module pays more attention to capturing long-term global dependencies and the convolution module focuses on local context modeling. The decoder consists of two parallel modules: one is the Transformer-decoder-based attention module and the other is the CTC module. The first one is removed in the testing phase and can guide the second one to extract robust features in the training phase. Extensive experiments on standard benchmarks demonstrate that GLaLT achieves state-of-the-art performance for both regular and irregular STR. In terms of tradeoffs, the proposed GLaLT is at or near the frontiers for maximizing speed, accuracy, and computational efficiency at the same time.
近年来,连接主义时间分类(CTC)和注意力机制在场景文本识别(STR)中越来越受欢迎。基于CTC的方法耗时较少,计算负担小,但不如基于注意力的方法有效。为了保持计算效率和有效性,我们提出了全局-局部注意力增强轻量级Transformer(GLaLT),它采用基于Transformer的编码器-解码器结构来协调CTC和注意力机制。编码器将自注意力模块与卷积模块集成以增强注意力,其中自注意力模块更注重捕捉长期全局依赖关系,而卷积模块专注于局部上下文建模。解码器由两个并行模块组成:一个是基于Transformer解码器的注意力模块,另一个是CTC模块。第一个模块在测试阶段被移除,并且在训练阶段可以引导第二个模块提取鲁棒特征。在标准基准上进行的大量实验表明,GLaLT在常规和不规则STR方面都取得了领先的性能。在权衡方面,所提出的GLaLT在同时最大化速度、准确性和计算效率方面处于或接近前沿水平。