IEEE Trans Image Process. 2022;31:5585-5598. doi: 10.1109/TIP.2022.3197981. Epub 2022 Aug 30.
The exploration of linguistic information promotes the development of scene text recognition task. Benefiting from the significance in parallel reasoning and global relationship capture, transformer-based language model (TLM) has achieved dominant performance recently. As a decoupled structure from the recognition process, we argue that TLM's capability is limited by the input low-quality visual prediction. To be specific: 1) The visual prediction with low character-wise accuracy increases the correction burden of TLM. 2) The inconsistent word length between visual prediction and original image provides a wrong language modeling guidance in TLM. In this paper, we propose a Progressive scEne Text Recognizer (PETR) to improve the capability of transformer-based language model by handling above two problems. Firstly, a Destruction Learning Module (DLM) is proposed to consider the linguistic information in the visual context. DLM introduces the recognition of destructed images with disordered patches in the training stage. Through guiding the vision model to restore patch orders and make word-level prediction on the destructed images, visual prediction with high character-wise accuracy is obtained by exploring inner relationship between the local visual patches. Secondly, a new Language Rectification Module (LRM) is proposed to optimize the word length for language guidance rectification. Through progressively implementing LRM in different language modeling steps, a novel progressive rectification network is constructed to handle some extremely challenging cases (e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR enhances the capability of transformer-based language model from a more general aspect, that is, focusing on the reduction of correction burden and rectification of language modeling guidance. Compared with parallel transformer-based methods, PETR obtains 1.0% and 0.8% improvement on regular and irregular datasets respectively while introducing only 1.7M additional parameters. The extensive experiments on both English and Chinese benchmarks demonstrate that PETR achieves the state-of-the-art results.
语言信息的探索促进了场景文本识别任务的发展。受益于并行推理和全局关系捕获的重要性,基于转换器的语言模型(TLM)最近取得了主导性能。作为与识别过程解耦的结构,我们认为 TLM 的能力受到输入低质量视觉预测的限制。具体来说:1)字符级精度低的视觉预测增加了 TLM 的校正负担。2)视觉预测与原始图像之间不一致的词长在 TLM 中提供了错误的语言建模指导。在本文中,我们提出了一种渐进式 scEne Text Recognizer(PETR),通过处理上述两个问题来提高基于转换器的语言模型的能力。首先,提出了一个破坏学习模块(DLM)来考虑视觉上下文中的语言信息。DLM 在训练阶段引入了对带有无序补丁的破坏图像的识别。通过引导视觉模型恢复补丁顺序并对破坏图像进行词级预测,通过探索局部视觉补丁之间的内在关系,获得具有高字符级精度的视觉预测。其次,提出了一种新的语言校正模块(LRM)来优化语言指导校正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新的渐进式校正网络来处理一些极具挑战性的情况(例如变形、遮挡等)。通过利用 DLM 和 LRM,PETR 从更一般的角度增强了基于转换器的语言模型的能力,即关注校正负担的减少和语言建模指导的校正。与并行基于转换器的方法相比,PETR 在规则和不规则数据集上分别提高了 1.0%和 0.8%,而仅引入了 1.7M 个额外参数。在英语和中文基准上的广泛实验表明,PETR 实现了最先进的结果。