IEEE Trans Image Process. 2021;30:1687-1701. doi: 10.1109/TIP.2020.3045602. Epub 2021 Jan 14.
Scene text recognition, the final step of the scene text reading system, has made impressive progress based on deep neural networks. However, existing recognition methods devote to dealing with the geometrically regular or irregular scene text. They are limited to the semantically arbitrary-orientation scene text. Meanwhile, previous scene text recognizers usually learn the single-scale feature representations for various-scale characters, which cannot model effective contexts for different characters. In this paper, we propose a novel scale-adaptive orientation attention network for arbitrary-orientation scene text recognition, which consists of a dynamic log-polar transformer and a sequence recognition network. Specifically, the dynamic log-polar transformer learns the log-polar origin to adaptively convert the arbitrary rotations and scales of scene texts into the shifts in the log-polar space, which is helpful to generate the rotation-aware and scale-aware visual representation. Next, the sequence recognition network is an encoder-decoder model, which incorporates a novel character-level receptive field attention module to encode more valid contexts for various-scale characters. The whole architecture can be trained in an end-to-end manner, only requiring the word image and its corresponding ground-truth text. Extensive experiments on several public datasets have demonstrated the effectiveness and superiority of our proposed method.
场景文本识别是场景文本阅读系统的最后一步,基于深度神经网络取得了令人瞩目的进展。然而,现有的识别方法致力于处理几何规则或不规则的场景文本,它们仅限于语义上任意方向的场景文本。同时,以前的场景文本识别器通常学习用于各种尺度字符的单一尺度特征表示,无法为不同字符建模有效的上下文。在本文中,我们提出了一种新颖的用于任意方向场景文本识别的尺度自适应方向注意力网络,它由动态对数极坐标转换器和序列识别网络组成。具体来说,动态对数极坐标转换器学习对数极坐标原点,自适应地将场景文本的任意旋转和尺度转换为对数极坐标空间中的移位,这有助于生成旋转感知和尺度感知的视觉表示。接下来,序列识别网络是一个编码器-解码器模型,它结合了一种新颖的字符级感受野注意力模块,为各种尺度的字符编码更多有效的上下文。整个架构可以端到端训练,只需要单词图像及其对应的地面真实文本。在几个公共数据集上的广泛实验表明了我们提出的方法的有效性和优越性。