用于场景文本识别的注意力引导特征编码

Attention Guided Feature Encoding for Scene Text Recognition.

作者信息

Hassan Ehtesham, V L Lekshmi

机构信息

Department of Computer Science and Engineering, Kuwait College of Science and Technology, Doha District, Block 4, Kuwait City 35004, Kuwait.

出版信息

J Imaging. 2022 Oct 8;8(10):276. doi: 10.3390/jimaging8100276.

DOI:10.3390/jimaging8100276

PMID:36286370

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9604773/

Abstract

The real-life scene images exhibit a range of variations in text appearances, including complex shapes, variations in sizes, and fancy font properties. Consequently, text recognition from scene images remains a challenging problem in computer vision research. We present a scene text recognition methodology by designing a novel feature-enhanced convolutional recurrent neural network architecture. Our work addresses scene text recognition as well as sequence-to-sequence modeling, where a novel deep encoder-decoder network is proposed. The encoder in the proposed network is designed around a hierarchy of convolutional blocks enabled with spatial attention blocks, followed by bidirectional long short-term memory layers. In contrast to existing methods for scene text recognition, which incorporate temporal attention on the decoder side of the entire architecture, our convolutional architecture incorporates novel spatial attention design to guide feature extraction onto textual details in scene text images. The experiments and analysis demonstrate that our approach learns robust text-specific feature sequences for input images, as the convolution architecture designed for feature extraction is tuned to capture a broader spatial text context. With extensive experiments on ICDAR2013, ICDAR2015, IIIT5K and SVT datasets, the paper demonstrates an improvement over many important state-of-the-art methods.

摘要

现实场景图像在文本外观上呈现出一系列变化，包括复杂的形状、大小变化和奇特的字体属性。因此，从场景图像中进行文本识别仍然是计算机视觉研究中的一个具有挑战性的问题。我们通过设计一种新颖的特征增强卷积循环神经网络架构，提出了一种场景文本识别方法。我们的工作涉及场景文本识别以及序列到序列建模，其中提出了一种新颖的深度编码器 - 解码器网络。所提出网络中的编码器围绕具有空间注意力块的卷积块层次结构进行设计，随后是双向长短期记忆层。与现有场景文本识别方法不同，现有方法在整个架构的解码器端引入时间注意力，而我们的卷积架构引入了新颖的空间注意力设计，以引导对场景文本图像中的文本细节进行特征提取。实验和分析表明，我们的方法为输入图像学习到了强大的特定于文本的特征序列，因为为特征提取设计的卷积架构经过调整以捕获更广泛的空间文本上下文。通过在ICDAR2013、ICDAR2015、IIIT5K和SVT数据集上进行广泛实验，本文展示了相对于许多重要的现有最先进方法的改进。