基于自动机序列匹配的手写阿拉伯文本页面合成

ASM Based Synthesis of Handwritten Arabic Text Pages.

作者信息

Dinges Laslo, Al-Hamadi Ayoub, Elzobi Moftah, El-Etriby Sherif, Ghoneim Ahmed

机构信息

Institute for Information Technology and Communications (IIKT), Otto-von-Guericke-University Magdeburg, 39016 Magdeburg, Germany.

Umm Al-Qura University, Makkah 21421, Saudi Arabia ; Faculty of Computers and Information, Menoufia University MUFIC, Menofia 32721, Egypt.

出版信息

ScientificWorldJournal. 2015;2015:323575. doi: 10.1155/2015/323575. Epub 2015 Jul 30.

DOI:10.1155/2015/323575

PMID:26295059

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4534626/

Abstract

Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.

摘要

文档分析任务，如文本识别、单词检测或分割，高度依赖于用于训练和验证的全面且合适的数据库。然而，从人力和时间角度来看，生成这些数据库成本高昂。事实上，缺乏这样的数据库使得研发工作变得复杂。对于阿拉伯手写体识别而言尤其如此，它涉及不同的预处理、分割和识别方法，这些方法对样本和真实标注有各自的要求。为绕过这个问题，我们提出了一个高效系统，该系统能自动将阿拉伯语Unicode文本转换为手写文档的合成图像以及详细的真实标注。基于28046个在线样本的主动形状模型（ASM）用于字符合成，并从IESK - arDB数据库中提取统计属性以模拟基线以及单词的倾斜或扭曲。在合成步骤中，基于ASM的表示被组合成单词和文本页面，通过B样条插值进行平滑处理，并考虑书写速度和笔的特性进行渲染。最后，我们使用合成数据来验证一种分割方法。与IESK - arDB数据库的实验比较表明，每当没有足够的带有自然真实标注的数据时，鼓励在合成样本上训练和测试与文档分析相关的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7084/4534626/738ff5bbfdca/TSWJ2015-323575.001.jpg

相似文献

ASM Based Synthesis of Handwritten Arabic Text Pages.

ScientificWorldJournal. 2015;2015:323575. doi: 10.1155/2015/323575. Epub 2015 Jul 30.

Synthesis of Common Arabic Handwritings to Aid Optical Character Recognition Research.

Sensors (Basel). 2016 Mar 11;16(3):346. doi: 10.3390/s16030346.

A scale space approach for automatically segmenting words from historical handwritten documents.

IEEE Trans Pattern Anal Mach Intell. 2005 Aug;27(8):1212-25. doi: 10.1109/TPAMI.2005.150.

A scarce dataset for ancient Arabic handwritten text recognition.

Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.

Writer verification of partially damaged handwritten Arabic documents based on individual character shapes.

PeerJ Comput Sci. 2022 Apr 20;8:e955. doi: 10.7717/peerj-cs.955. eCollection 2022.

iVision HHID: Handwritten hyperspectral images dataset for benchmarking hyperspectral imaging-based document forensic analysis.

Data Brief. 2022 Feb 16;41:107964. doi: 10.1016/j.dib.2022.107964. eCollection 2022 Apr.

Arabic handwritten alphabets, words and paragraphs per user (AHAWP) dataset.

Data Brief. 2022 Feb 13;41:107947. doi: 10.1016/j.dib.2022.107947. eCollection 2022 Apr.

Word Spotting and Recognition with Embedded Attributes.

IEEE Trans Pattern Anal Mach Intell. 2014 Dec;36(12):2552-66. doi: 10.1109/TPAMI.2014.2339814.

Script-independent text line segmentation in freestyle handwritten documents.

IEEE Trans Pattern Anal Mach Intell. 2008 Aug;30(8):1313-29. doi: 10.1109/TPAMI.2007.70792.

Handwritten Chinese text recognition by integrating multiple contexts.

IEEE Trans Pattern Anal Mach Intell. 2012 Aug;34(8):1469-81. doi: 10.1109/TPAMI.2011.264.

引用本文的文献

Synthesis of Common Arabic Handwritings to Aid Optical Character Recognition Research.

Sensors (Basel). 2016 Mar 11;16(3):346. doi: 10.3390/s16030346.

本文引用的文献

A method of recognition of arabic cursive handwriting.

IEEE Trans Pattern Anal Mach Intell. 1987 May;9(5):715-22. doi: 10.1109/tpami.1987.4767970.

Offline Arabic handwriting recognition: a survey.

IEEE Trans Pattern Anal Mach Intell. 2006 May;28(5):712-24. doi: 10.1109/TPAMI.2006.102.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于自动机序列匹配的手写阿拉伯文本页面合成

ASM Based Synthesis of Handwritten Arabic Text Pages.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献