DenseTextPVT：基于深度多尺度特征细化网络的金字塔视觉 Transformer 用于密集文本检测。

DenseTextPVT: Pyramid Vision Transformer with Deep Multi-Scale Feature Refinement Network for Dense Text Detection.

机构信息

Department of Artificial Intelligence Convergence, Chonnam National University, 77 Yongbong-ro, Gwangju 500-757, Republic of Korea.

出版信息

Sensors (Basel). 2023 Jun 25;23(13):5889. doi: 10.3390/s23135889.

DOI:10.3390/s23135889

PMID:37447738

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10347224/

Abstract

Detecting dense text in scene images is a challenging task due to the high variability, complexity, and overlapping of text areas. To adequately distinguish text instances with high density in scenes, we propose an efficient approach called DenseTextPVT. We first generated high-resolution features at different levels to enable accurate dense text detection, which is essential for dense prediction tasks. Additionally, to enhance the feature representation, we designed the Deep Multi-scale Feature Refinement Network (DMFRN), which effectively detects texts of varying sizes, shapes, and fonts, including small-scale texts. DenseTextPVT, then, is inspired by Pixel Aggregation (PA) similarity vector algorithms to cluster text pixels into correct text kernels in the post-processing step. In this way, our proposed method enhances the precision of text detection and effectively reduces overlapping between text regions under dense adjacent text in natural images. The comprehensive experiments indicate the effectiveness of our method on the TotalText, CTW1500, and ICDAR-2015 benchmark datasets in comparison to existing methods.

摘要

检测场景图像中的密集文本是一项具有挑战性的任务，因为文本区域的高度可变性、复杂性和重叠。为了充分区分场景中高密度的文本实例，我们提出了一种称为 DenseTextPVT 的高效方法。我们首先在不同级别生成高分辨率特征，以实现准确的密集文本检测，这对于密集预测任务至关重要。此外，为了增强特征表示，我们设计了 Deep Multi-scale Feature Refinement Network (DMFRN)，它可以有效地检测不同大小、形状和字体的文本，包括小尺度文本。DenseTextPVT 然后受到像素聚合 (PA) 相似性向量算法的启发，在后处理步骤中将文本像素聚类为正确的文本核。通过这种方式，我们的方法提高了文本检测的精度，并有效地减少了自然图像中密集相邻文本区域之间的重叠。综合实验表明，与现有方法相比，我们的方法在 TotalText、CTW1500 和 ICDAR-2015 基准数据集上的有效性。