Suppr超能文献

基于自动机序列匹配的手写阿拉伯文本页面合成

ASM Based Synthesis of Handwritten Arabic Text Pages.

作者信息

Dinges Laslo, Al-Hamadi Ayoub, Elzobi Moftah, El-Etriby Sherif, Ghoneim Ahmed

机构信息

Institute for Information Technology and Communications (IIKT), Otto-von-Guericke-University Magdeburg, 39016 Magdeburg, Germany.

Umm Al-Qura University, Makkah 21421, Saudi Arabia ; Faculty of Computers and Information, Menoufia University MUFIC, Menofia 32721, Egypt.

出版信息

ScientificWorldJournal. 2015;2015:323575. doi: 10.1155/2015/323575. Epub 2015 Jul 30.

Abstract

Document analysis tasks, as text recognition, word spotting, or segmentation, are highly dependent on comprehensive and suitable databases for training and validation. However their generation is expensive in sense of labor and time. As a matter of fact, there is a lack of such databases, which complicates research and development. This is especially true for the case of Arabic handwriting recognition, that involves different preprocessing, segmentation, and recognition methods, which have individual demands on samples and ground truth. To bypass this problem, we present an efficient system that automatically turns Arabic Unicode text into synthetic images of handwritten documents and detailed ground truth. Active Shape Models (ASMs) based on 28046 online samples were used for character synthesis and statistical properties were extracted from the IESK-arDB database to simulate baselines and word slant or skew. In the synthesis step ASM based representations are composed to words and text pages, smoothed by B-Spline interpolation and rendered considering writing speed and pen characteristics. Finally, we use the synthetic data to validate a segmentation method. An experimental comparison with the IESK-arDB database encourages to train and test document analysis related methods on synthetic samples, whenever no sufficient natural ground truthed data is available.

摘要

文档分析任务,如文本识别、单词检测或分割,高度依赖于用于训练和验证的全面且合适的数据库。然而,从人力和时间角度来看,生成这些数据库成本高昂。事实上,缺乏这样的数据库使得研发工作变得复杂。对于阿拉伯手写体识别而言尤其如此,它涉及不同的预处理、分割和识别方法,这些方法对样本和真实标注有各自的要求。为绕过这个问题,我们提出了一个高效系统,该系统能自动将阿拉伯语Unicode文本转换为手写文档的合成图像以及详细的真实标注。基于28046个在线样本的主动形状模型(ASM)用于字符合成,并从IESK - arDB数据库中提取统计属性以模拟基线以及单词的倾斜或扭曲。在合成步骤中,基于ASM的表示被组合成单词和文本页面,通过B样条插值进行平滑处理,并考虑书写速度和笔的特性进行渲染。最后,我们使用合成数据来验证一种分割方法。与IESK - arDB数据库的实验比较表明,每当没有足够的带有自然真实标注的数据时,鼓励在合成样本上训练和测试与文档分析相关的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7084/4534626/738ff5bbfdca/TSWJ2015-323575.001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验