• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于可解释的连接主义时间分类的场景文本识别

Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition.

作者信息

Buoy Rina, Iwamura Masakazu, Srun Sovila, Kise Koichi

机构信息

Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Osaka 599-8531, Japan.

Department of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh 12156, Cambodia.

出版信息

J Imaging. 2023 Nov 15;9(11):248. doi: 10.3390/jimaging9110248.

DOI:10.3390/jimaging9110248
PMID:37998095
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10672533/
Abstract

Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable 2D spatial relationship between the predicted characters and corresponding image regions, essential for model explainability. On the other hand, 2D attention-based methods enhance recognition accuracy and offer character location information via cross-attention mechanisms, linking predictions to image regions. However, these methods are more computationally intensive, compared with the 1D CTC-based methods. To achieve both low latency and model explainability via character localization using a 1D CTC decoder, we propose a marginalization-based method that processes 2D feature maps and predicts a sequence of 2D joint probability distributions over the height and class dimensions. Based on the proposed method, we newly introduce an association map that aids in character localization and model prediction explanation. This map parallels the role of a cross-attention map, as seen in computationally-intensive attention-based architectures. With the proposed method, we consider a ViT-CTC STR architecture that uses a 1D CTC decoder and a pretrained vision Transformer (ViT) as a 2D feature extractor. Our ViT-CTC models were trained on synthetic data and fine-tuned on real labeled sets. These models outperform the recent state-of-the-art (SOTA) CTC-based methods on benchmarks in terms of recognition accuracy. Compared with the baseline Transformer-decoder-based models, our ViT-CTC models offer a speed boost up to 12 times regardless of the backbone, with a maximum 3.1% reduction in total word recognition accuracy. In addition, both qualitative and quantitative assessments of character locations estimated from the association map align closely with those from the cross-attention map and ground-truth character-level bounding boxes.

摘要

连接主义时间分类(CTC)因其简单性和效率,在场景文本识别(STR)中是一种受欢迎的解码器。然而,大多数基于CTC的方法使用一维(1D)向量序列,通常由循环神经网络(RNN)编码器导出。这导致预测字符与相应图像区域之间缺乏可解释的二维空间关系,而这对于模型的可解释性至关重要。另一方面,基于二维注意力的方法通过交叉注意力机制提高了识别准确率,并提供了字符定位信息,将预测与图像区域联系起来。然而,与基于一维CTC的方法相比,这些方法的计算量更大。为了通过使用一维CTC解码器进行字符定位来实现低延迟和模型可解释性,我们提出了一种基于边缘化的方法,该方法处理二维特征图,并预测高度和类别维度上的二维联合概率分布序列。基于所提出的方法,我们新引入了一个关联图,它有助于字符定位和模型预测解释。该图类似于计算量较大的基于注意力的架构中的交叉注意力图的作用。使用所提出的方法,我们考虑了一种ViT-CTC STR架构,该架构使用一维CTC解码器和预训练的视觉Transformer(ViT)作为二维特征提取器。我们的ViT-CTC模型在合成数据上进行训练,并在真实标记集上进行微调。在基准测试中,这些模型在识别准确率方面优于最近基于CTC的最新技术(SOTA)方法。与基于Transformer解码器的基线模型相比,我们的ViT-CTC模型无论主干如何,速度都能提高到12倍,总单词识别准确率最多降低3.1%。此外,从关联图估计的字符位置的定性和定量评估与从交叉注意力图和真实字符级边界框的评估密切一致。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/be4fad8620fd/jimaging-09-00248-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/27c1479e720e/jimaging-09-00248-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/f97681b1f1c3/jimaging-09-00248-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/a2f4cf597e7e/jimaging-09-00248-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/cbedf353775e/jimaging-09-00248-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/f912a967c0eb/jimaging-09-00248-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/8014d35e22c2/jimaging-09-00248-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/3d8277861b90/jimaging-09-00248-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/e19384d9cfc5/jimaging-09-00248-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/779c2df46298/jimaging-09-00248-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/a2824b93e9b3/jimaging-09-00248-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/c11f64bede9d/jimaging-09-00248-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/be4fad8620fd/jimaging-09-00248-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/27c1479e720e/jimaging-09-00248-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/f97681b1f1c3/jimaging-09-00248-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/a2f4cf597e7e/jimaging-09-00248-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/cbedf353775e/jimaging-09-00248-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/f912a967c0eb/jimaging-09-00248-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/8014d35e22c2/jimaging-09-00248-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/3d8277861b90/jimaging-09-00248-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/e19384d9cfc5/jimaging-09-00248-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/779c2df46298/jimaging-09-00248-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/a2824b93e9b3/jimaging-09-00248-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/c11f64bede9d/jimaging-09-00248-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f406/10672533/be4fad8620fd/jimaging-09-00248-g012.jpg

相似文献

1
Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition.基于可解释的连接主义时间分类的场景文本识别
J Imaging. 2023 Nov 15;9(11):248. doi: 10.3390/jimaging9110248.
2
ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition.ViTSTR-变换器:用于场景文本识别的无交叉注意力视觉变换器变换器
J Imaging. 2023 Dec 13;9(12):276. doi: 10.3390/jimaging9120276.
3
GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition.GLaLT:用于场景文本识别的全局-局部注意力增强轻量级Transformer
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):10145-10158. doi: 10.1109/TNNLS.2023.3239696. Epub 2024 Jul 8.
4
RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers.RT-ViT:基于轻量级视觉Transformer 的实时单目深度估计。
Sensors (Basel). 2022 May 19;22(10):3849. doi: 10.3390/s22103849.
5
Lightweight Scene Text Recognition Based on Transformer.基于 Transformer 的轻量级场景文本识别。
Sensors (Basel). 2023 May 5;23(9):4490. doi: 10.3390/s23094490.
6
SLOAN: Scale-Adaptive Orientation Attention Network for Scene Text Recognition.斯隆:用于场景文本识别的尺度自适应方向注意网络。
IEEE Trans Image Process. 2021;30:1687-1701. doi: 10.1109/TIP.2020.3045602. Epub 2021 Jan 14.
7
Enhancement of handwritten text recognition using AI-based hybrid approach.基于人工智能的混合方法对手写文本识别的增强。
MethodsX. 2024 Mar 10;12:102654. doi: 10.1016/j.mex.2024.102654. eCollection 2024 Jun.
8
Multiple attention-based encoder-decoder networks for gas meter character recognition.基于多头注意力的编解码器网络在煤气表字符识别中的应用。
Sci Rep. 2022 Jun 20;12(1):10371. doi: 10.1038/s41598-022-14434-0.
9
Attention Guided Feature Encoding for Scene Text Recognition.用于场景文本识别的注意力引导特征编码
J Imaging. 2022 Oct 8;8(10):276. doi: 10.3390/jimaging8100276.
10
Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition.用于精确场景文本识别的图像到字符再到单词的变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12908-12921. doi: 10.1109/TPAMI.2022.3230962. Epub 2023 Oct 3.

本文引用的文献

1
ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.ASTER:具有灵活矫正功能的注意场景文本识别器。
IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2035-2048. doi: 10.1109/TPAMI.2018.2848939. Epub 2018 Jun 25.
2
An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition.基于图像的序列识别的端到端可训练神经网络及其在场景文本识别中的应用。
IEEE Trans Pattern Anal Mach Intell. 2017 Nov;39(11):2298-2304. doi: 10.1109/TPAMI.2016.2646371. Epub 2016 Dec 29.