用于古代阿拉伯手写文本识别的稀缺数据集。

A scarce dataset for ancient Arabic handwritten text recognition.

作者信息

Najam Rayyan, Faizullah Safiullah

机构信息

Department of Computer Science, Islamic University, Madinah 42351, Saudi Arabia.

出版信息

Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.

DOI:10.1016/j.dib.2024.110813

PMID:39252777

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11381460/

Abstract

Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.

摘要

开发深度学习光学字符识别是一个活跃的研究领域，在这个领域中，基于深度神经网络的模型会在数据上进行训练，最终从图像中提取文本。尽管目前该领域总体上取得了许多进展，但阿拉伯语光学字符识别领域明显缺乏用于古代手稿的数据集。在这里，我们通过提供一组古代阿拉伯语手稿的图像和文本真值来填补这一空白。这个稀缺的数据集是从麦地那伊斯兰大学中央图书馆收集的，它包含了跨越几个世纪、来自不同地区的丰富文本。具体来说，这个数据集中有八本古书，共四十页，既有图像也有专家转录的文本。特别地，由于此类数据无法公开获取，这个数据集具有重要价值，它显著有助于研究人员和从业者进行深度学习模型的开发/扩充、验证、测试以及泛化，无论是用于阿拉伯语光学字符识别任务还是阿拉伯语文本校正任务。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8af2/11381460/cbb5af63b529/gr1.jpg

相似文献

A scarce dataset for ancient Arabic handwritten text recognition.

Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.

Synthesis of Common Arabic Handwritings to Aid Optical Character Recognition Research.

Sensors (Basel). 2016 Mar 11;16(3):346. doi: 10.3390/s16030346.

A Deep Learning Approach for Arabic Manuscripts Classification.

Sensors (Basel). 2023 Sep 28;23(19):8133. doi: 10.3390/s23198133.

Generative adversarial network based adaptive data augmentation for handwritten Arabic text recognition.

PeerJ Comput Sci. 2022 Jan 25;8:e861. doi: 10.7717/peerj-cs.861. eCollection 2022.

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.

Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.

Writer verification of partially damaged handwritten Arabic documents based on individual character shapes.

PeerJ Comput Sci. 2022 Apr 20;8:e955. doi: 10.7717/peerj-cs.955. eCollection 2022.

Arabic handwritten alphabets, words and paragraphs per user (AHAWP) dataset.

Data Brief. 2022 Feb 13;41:107947. doi: 10.1016/j.dib.2022.107947. eCollection 2022 Apr.

ASM Based Synthesis of Handwritten Arabic Text Pages.

ScientificWorldJournal. 2015;2015:323575. doi: 10.1155/2015/323575. Epub 2015 Jul 30.

Enhancement of handwritten text recognition using AI-based hybrid approach.

MethodsX. 2024 Mar 10;12:102654. doi: 10.1016/j.mex.2024.102654. eCollection 2024 Jun.

Arabic Captioning for Images of Clothing Using Deep Learning.

Sensors (Basel). 2023 Apr 7;23(8):3783. doi: 10.3390/s23083783.

本文引用的文献

Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR).

PeerJ Comput Sci. 2024 Apr 29;10:e1964. doi: 10.7717/peerj-cs.1964. eCollection 2024.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于古代阿拉伯手写文本识别的稀缺数据集。

A scarce dataset for ancient Arabic handwritten text recognition.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献