Ruldeviyani Yova, Suhartanto Heru, Sotardodo Beltsazar Anugrah, Fahreza Muhammad Hanif, Septiano Andre, Rachmadi Muhammad Febrian
Faculty of Computer Science, Universitas Indonesia, Depok, Jawa Barat, 16424, Indonesia.
Heliyon. 2024 Aug 10;10(16):e35959. doi: 10.1016/j.heliyon.2024.e35959. eCollection 2024 Aug 30.
The Pegon script is an Arabic-based writing system used for Javanese, Sundanese, Madurese, and Indonesian languages. Due to various reasons, this script is now mainly found among collectors and private Islamic boarding schools (pesantren), creating a need for its preservation. One preservation method is digitization through transcription into machine-encoded text, known as OCR (Optical Character Recognition). No published literature exists on OCR systems for this specific script. This research explores the OCR of Pegon typed manuscripts, introducing novel synthesized and real annotated datasets for this task. These datasets evaluate proposed OCR methods, especially those adapted from existing Arabic OCR systems. Results show that deep learning techniques outperform conventional ones, which fail to detect Pegon text. The proposed system uses YOLOv5 for line segmentation and a CTC-CRNN architecture for line text recognition, achieving an F1-score of 0.94 for segmentation and a CER of 0.03 for recognition.
佩贡文字是一种基于阿拉伯文的书写系统,用于爪哇语、巽他语、马都拉语和印度尼西亚语。由于各种原因,这种文字现在主要出现在收藏家以及私立伊斯兰寄宿学校(宗教学校)中,因此需要对其进行保护。一种保护方法是通过转录成机器编码文本进行数字化,即所谓的光学字符识别(OCR)。目前尚无关于这种特定文字的OCR系统的已发表文献。本研究探索佩贡文字手写稿的OCR技术,为此任务引入了新的合成和真实标注数据集。这些数据集用于评估所提出的OCR方法,特别是那些改编自现有阿拉伯文OCR系统的方法。结果表明,深度学习技术优于传统技术,传统技术无法检测到佩贡文字。所提出的系统使用YOLOv5进行行分割,并使用CTC-CRNN架构进行行文字识别,分割的F1分数为0.94,识别的字符错误率为0.03。