Suppr超能文献

PF-ViT:用于离线手写汉字识别的并行快速视觉Transformer。

PF-ViT: Parallel and Fast Vision Transformer for Offline Handwritten Chinese Character Recognition.

机构信息

School of Electronic Information, Zhongyuan University of Technology, Zhengzhou 450007, Henan, China.

出版信息

Comput Intell Neurosci. 2022 Sep 28;2022:8255763. doi: 10.1155/2022/8255763. eCollection 2022.

Abstract

Recently, Vision Transformer (ViT) has been widely used in the field of image recognition. Unfortunately, the ViT model repeatedly stacks 12-layer encoders, resulting in a large number of model computations, many parameters, and slow training speed, making it difficult to deploy on mobile devices. In order to reduce the computational complexity of the model and improve the training speed, a parallel and fast Vision Transformer method for offline handwritten Chinese character recognition is proposed. The method adds parallel branches of the encoder module to the structure of the Vision Transformer model. Parallel modes include two-way parallel, four-way parallel, and seven-way parallel. The original picture is fed to the encoder module after flattening and linear embedding processing operations. The core step in the encoder is the multihead attention mechanism. Multihead self-attention can learn the interdependence between image sequence blocks. In addition, the use of data expansion strategies increases the diversity of data. In the two-way parallel experiment, when the model is 98.1% accurate on the dataset, the number of parameters and the number of FLOPs are 43.11 million and 4.32 G, respectively. Compared with the ViT model, whose parameters and FLOPs are 86 million and 16.8 G, respectively, the two-way parallel model has a 50.1% decrease in parameters and a 34.6% decrease in FLOPs. This method has been demonstrated to effectively reduce the computational complexity of the model while indirectly improving image recognition speed.

摘要

最近,视觉转换器 (ViT) 在图像识别领域得到了广泛应用。然而,ViT 模型重复堆叠 12 层编码器,导致模型计算量大、参数多、训练速度慢,难以在移动设备上部署。为了降低模型的计算复杂度并提高训练速度,提出了一种用于离线手写汉字识别的并行快速视觉转换器方法。该方法在 Vision Transformer 模型的结构中添加了编码器模块的并行分支。并行模式包括双向并行、四路并行和七路并行。原始图像经过扁平化和线性嵌入处理操作后输入到编码器模块。编码器的核心步骤是多头注意力机制。多头自注意力可以学习图像序列块之间的相互依赖性。此外,还采用了数据扩充策略来增加数据的多样性。在双向并行实验中,当模型在数据集上的准确率达到 98.1%时,模型的参数和 FLOPs 分别为 4311 万和 4.32G。与参数和 FLOPs 分别为 8600 万和 16.8G 的 ViT 模型相比,双向并行模型的参数减少了 50.1%,FLOPs 减少了 34.6%。该方法已被证明可以有效降低模型的计算复杂度,从而间接提高图像识别速度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4940/9534625/76d5e1d6a920/CIN2022-8255763.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验