Najam Rayyan, Faizullah Safiullah
Department of Computer Science, Islamic University, Madinah 42351, Saudi Arabia.
Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.
Developing Deep Learning Optical Character Recognition is an active area of research, where models based on deep neural networks are trained on data to eventually extract text within an image. Even though many advances are currently being made in this area in general, the Arabic OCR domain notably lacks a dataset for ancient manuscripts. Here, we fill this gap by providing both the image and textual ground truth for a collection of ancient Arabic manuscripts. This scarce dataset is collected from the central library of the Islamic University of Madinah, and it encompasses rich text spanning different geographies across centuries. Specifically, eight ancient books with a total of forty pages, both images and text, transcribed by the experts, are present in this dataset. Particularly, this dataset holds a significant value due to the unavailability of such data publicly, which conspicuously contributes to the deep learning models development/augmenting, validation, testing, and generalization by researchers and practitioners, both for the tasks of Arabic OCR and Arabic text correction.
开发深度学习光学字符识别是一个活跃的研究领域,在这个领域中,基于深度神经网络的模型会在数据上进行训练,最终从图像中提取文本。尽管目前该领域总体上取得了许多进展,但阿拉伯语光学字符识别领域明显缺乏用于古代手稿的数据集。在这里,我们通过提供一组古代阿拉伯语手稿的图像和文本真值来填补这一空白。这个稀缺的数据集是从麦地那伊斯兰大学中央图书馆收集的,它包含了跨越几个世纪、来自不同地区的丰富文本。具体来说,这个数据集中有八本古书,共四十页,既有图像也有专家转录的文本。特别地,由于此类数据无法公开获取,这个数据集具有重要价值,它显著有助于研究人员和从业者进行深度学习模型的开发/扩充、验证、测试以及泛化,无论是用于阿拉伯语光学字符识别任务还是阿拉伯语文本校正任务。