Bi Hengyue, Xu Canhui, Shi Cao, Liu Guozhu, Zhang Honghong, Li Yuteng, Dong Junyu
IEEE Trans Image Process. 2023;32:4142-4155. doi: 10.1109/TIP.2023.3294822. Epub 2023 Jul 20.
As a prerequisite step of scene text reading, scene text detection is known as a challenging task due to natural scene text diversity and variability. Most existing methods either adopt bottom-up sub-text component extraction or focus on top-down text contour regression. From a hybrid perspective, we explore hierarchical text instance-level and component-level representation for arbitrarily-shaped scene text detection. In this work, we propose a novel Hierarchical Graph Reasoning Network (HGR-Net), which consists of a Text Feature Extraction Network (TFEN) and a Text Relation Learner Network (TRLN). TFEN adaptively learns multi-grained text candidates based on shared convolutional feature maps, including instance-level text contours and component-level quadrangles. In TRLN, an inter-text graph is constructed to explore global contextual information with position-awareness between text instances, and an intra-text graph is designed to estimate geometric attributes for establishing component-level linkages. Next, we bridge the cross-feed interaction between instance-level and component-level, and it further achieves hierarchical relational reasoning by learning complementary graph embeddings across levels. Experiments conducted on three publicly available benchmarks SCUT-CTW1500, Total-Text, and ICDAR15 have demonstrated that HGR-Net achieves state-of-the-art performance on arbitrary orientation and arbitrary shape scene text detection.
作为场景文本阅读的前提步骤,由于自然场景文本的多样性和变异性,场景文本检测是一项具有挑战性的任务。大多数现有方法要么采用自下而上的子文本组件提取,要么专注于自上而下的文本轮廓回归。从混合的角度出发,我们探索用于任意形状场景文本检测的层次化文本实例级和组件级表示。在这项工作中,我们提出了一种新颖的层次图推理网络(HGR-Net),它由文本特征提取网络(TFEN)和文本关系学习网络(TRLN)组成。TFEN基于共享卷积特征图自适应地学习多粒度文本候选,包括实例级文本轮廓和组件级四边形。在TRLN中,构建一个文本间图以探索文本实例之间具有位置感知的全局上下文信息,并设计一个文本内图来估计几何属性以建立组件级联系。接下来,我们在实例级和组件级之间建立交叉馈送交互,并通过跨级别学习互补图嵌入进一步实现层次关系推理。在三个公开基准数据集SCUT-CTW1500、Total-Text和ICDAR15上进行的实验表明,HGR-Net在任意方向和任意形状的场景文本检测中都取得了领先的性能。