IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2035-2048. doi: 10.1109/TPAMI.2018.2848939. Epub 2018 Jun 25.
A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.
场景文本识别的一个挑战是处理具有扭曲或不规则布局的文本。特别是,透视文本和弯曲文本在自然场景中很常见,很难识别。在这项工作中,我们引入了 ASTER,这是一个端到端的神经网络模型,它由一个矫正网络和一个识别网络组成。矫正网络自适应地将输入图像转换为新的图像,对其中的文本进行矫正。它由一个灵活的薄板样条变换驱动,该变换可以处理各种文本不规则性,并且在没有人工注释的情况下进行训练。识别网络是一个注意力序列到序列模型,它直接从矫正后的图像中预测字符序列。整个模型是端到端训练的,只需要图像及其地面真实文本。通过广泛的实验,我们验证了矫正的有效性,并展示了 ASTER 的最先进的识别性能。此外,我们还证明了 ASTER 是端到端识别系统中的一个强大组件,因为它能够增强检测器。