• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PETR:重新思考基于转换器的语言模型在场景文本识别中的能力。

PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition.

出版信息

IEEE Trans Image Process. 2022;31:5585-5598. doi: 10.1109/TIP.2022.3197981. Epub 2022 Aug 30.

DOI:10.1109/TIP.2022.3197981
PMID:35998166
Abstract

The exploration of linguistic information promotes the development of scene text recognition task. Benefiting from the significance in parallel reasoning and global relationship capture, transformer-based language model (TLM) has achieved dominant performance recently. As a decoupled structure from the recognition process, we argue that TLM's capability is limited by the input low-quality visual prediction. To be specific: 1) The visual prediction with low character-wise accuracy increases the correction burden of TLM. 2) The inconsistent word length between visual prediction and original image provides a wrong language modeling guidance in TLM. In this paper, we propose a Progressive scEne Text Recognizer (PETR) to improve the capability of transformer-based language model by handling above two problems. Firstly, a Destruction Learning Module (DLM) is proposed to consider the linguistic information in the visual context. DLM introduces the recognition of destructed images with disordered patches in the training stage. Through guiding the vision model to restore patch orders and make word-level prediction on the destructed images, visual prediction with high character-wise accuracy is obtained by exploring inner relationship between the local visual patches. Secondly, a new Language Rectification Module (LRM) is proposed to optimize the word length for language guidance rectification. Through progressively implementing LRM in different language modeling steps, a novel progressive rectification network is constructed to handle some extremely challenging cases (e.g. distortion, occlusion, etc.). By utilizing DLM and LRM, PETR enhances the capability of transformer-based language model from a more general aspect, that is, focusing on the reduction of correction burden and rectification of language modeling guidance. Compared with parallel transformer-based methods, PETR obtains 1.0% and 0.8% improvement on regular and irregular datasets respectively while introducing only 1.7M additional parameters. The extensive experiments on both English and Chinese benchmarks demonstrate that PETR achieves the state-of-the-art results.

摘要

语言信息的探索促进了场景文本识别任务的发展。受益于并行推理和全局关系捕获的重要性,基于转换器的语言模型(TLM)最近取得了主导性能。作为与识别过程解耦的结构,我们认为 TLM 的能力受到输入低质量视觉预测的限制。具体来说:1)字符级精度低的视觉预测增加了 TLM 的校正负担。2)视觉预测与原始图像之间不一致的词长在 TLM 中提供了错误的语言建模指导。在本文中,我们提出了一种渐进式 scEne Text Recognizer(PETR),通过处理上述两个问题来提高基于转换器的语言模型的能力。首先,提出了一个破坏学习模块(DLM)来考虑视觉上下文中的语言信息。DLM 在训练阶段引入了对带有无序补丁的破坏图像的识别。通过引导视觉模型恢复补丁顺序并对破坏图像进行词级预测,通过探索局部视觉补丁之间的内在关系,获得具有高字符级精度的视觉预测。其次,提出了一种新的语言校正模块(LRM)来优化语言指导校正的词长。通过在不同的语言建模步骤中逐步实现 LRM,构建了一个新的渐进式校正网络来处理一些极具挑战性的情况(例如变形、遮挡等)。通过利用 DLM 和 LRM,PETR 从更一般的角度增强了基于转换器的语言模型的能力,即关注校正负担的减少和语言建模指导的校正。与并行基于转换器的方法相比,PETR 在规则和不规则数据集上分别提高了 1.0%和 0.8%,而仅引入了 1.7M 个额外参数。在英语和中文基准上的广泛实验表明,PETR 实现了最先进的结果。

相似文献

1
PETR: Rethinking the Capability of Transformer-Based Language Model in Scene Text Recognition.PETR:重新思考基于转换器的语言模型在场景文本识别中的能力。
IEEE Trans Image Process. 2022;31:5585-5598. doi: 10.1109/TIP.2022.3197981. Epub 2022 Aug 30.
2
ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting.ABINet++:面向场景文本定位的自主、双向和迭代语言建模。
IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7123-7141. doi: 10.1109/TPAMI.2022.3223908. Epub 2023 May 5.
3
Display-Semantic Transformer for Scene Text Recognition.用于场景文本识别的显示语义变换器
Sensors (Basel). 2023 Sep 28;23(19):8159. doi: 10.3390/s23198159.
4
Image-to-Character-to-Word Transformers for Accurate Scene Text Recognition.用于精确场景文本识别的图像到字符再到单词的变换器
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):12908-12921. doi: 10.1109/TPAMI.2022.3230962. Epub 2023 Oct 3.
5
Scene Uyghur Recognition Based on Visual Prediction Enhancement.基于视觉预测增强的场景维吾尔语识别
Sensors (Basel). 2023 Oct 20;23(20):8610. doi: 10.3390/s23208610.
6
ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.ASTER:具有灵活矫正功能的注意场景文本识别器。
IEEE Trans Pattern Anal Mach Intell. 2019 Sep;41(9):2035-2048. doi: 10.1109/TPAMI.2018.2848939. Epub 2018 Jun 25.
7
GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition.GLaLT:用于场景文本识别的全局-局部注意力增强轻量级Transformer
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):10145-10158. doi: 10.1109/TNNLS.2023.3239696. Epub 2024 Jul 8.
8
SLOAN: Scale-Adaptive Orientation Attention Network for Scene Text Recognition.斯隆:用于场景文本识别的尺度自适应方向注意网络。
IEEE Trans Image Process. 2021;30:1687-1701. doi: 10.1109/TIP.2020.3045602. Epub 2021 Jan 14.
9
ET-Network: A novel efficient transformer deep learning model for automated Urdu handwritten text recognition.ET-Network:一种新颖高效的用于自动乌尔都语手写文字识别的变压器深度学习模型。
PLoS One. 2024 May 17;19(5):e0302590. doi: 10.1371/journal.pone.0302590. eCollection 2024.
10
Application of the transformer model algorithm in chinese word sense disambiguation: a case study in chinese language.变压器模型算法在中文词义消歧中的应用:以中文语言为例的研究
Sci Rep. 2024 Mar 15;14(1):6320. doi: 10.1038/s41598-024-56976-5.

引用本文的文献

1
Vision transformer architecture and applications in digital health: a tutorial and survey.视觉Transformer架构及其在数字健康中的应用:教程与综述
Vis Comput Ind Biomed Art. 2023 Jul 10;6(1):14. doi: 10.1186/s42492-023-00140-9.