Salih Sardar Omar, Jacksi Karwan
Web Technology Dept., Duhok Technical Institute, Duhok Polytechnic University, Duhok, Iraq.
Information Technology Dept., Technical College of Informatics, Akre University of Applied Sciences, Duhok, Iraq.
Data Brief. 2025 May 14;60:111648. doi: 10.1016/j.dib.2025.111648. eCollection 2025 Jun.
Scene Text Recognition (STR) has advanced significantly in recent years, yet languages utilizing Arabic-based scripts, such as Kurdish, remain underrepresented in existing datasets. This paper introduces KSTRV1, the first large-scale dataset designed for Kurdish Scene Text Recognition (KSTR), addressing the lack of resources for non-Latin scripts. The dataset comprises 1,420 natural scene images and 19,872 cropped word samples, covering Kurdish (Sorani and Badini dialects), Arabic, and English. Additionally, 20,000 synthetic text instances have been generated to enhance the dataset's diversity, quantity, and quality by incorporating varied fonts, orientations, distortions, and background complexities. KSTRV1 captures the multilingual landscape of the Kurdistan Region while addressing real-world challenges like occlusion, lighting variations, and script complexity. The dataset includes detailed annotations with bounding boxes, language identification, and text orientation labels, ensuring comprehensive support for training and evaluating STR models. By providing both natural and synthetic data, KSTRV1 enables the development of robust text recognition models, particularly for Central Kurdish, a low-resource language. The KSTRV1 dataset is publicly available at https://doi.org/10.5281/zenodo.15038953 and is expected to significantly contribute to research in multilingual STR, document analysis, and optical character recognition (OCR), facilitating more inclusive and accurate text recognition systems.
近年来,场景文本识别(STR)取得了显著进展,然而,使用基于阿拉伯文脚本的语言,如库尔德语,在现有数据集中的代表性仍然不足。本文介绍了KSTRV1,这是第一个为库尔德语场景文本识别(KSTR)设计的大规模数据集,解决了非拉丁脚本资源匮乏的问题。该数据集包含1420张自然场景图像和19872个裁剪后的单词样本,涵盖库尔德语(索拉尼语和巴迪尼方言)、阿拉伯语和英语。此外,还生成了20000个合成文本实例,通过纳入不同的字体、方向、变形和背景复杂性来提高数据集的多样性、数量和质量。KSTRV1捕捉了库尔德地区的多语言环境,同时解决了诸如遮挡、光照变化和脚本复杂性等现实世界挑战。该数据集包括带有边界框、语言识别和文本方向标签的详细注释,确保为训练和评估STR模型提供全面支持。通过提供自然和合成数据,KSTRV1能够开发强大的文本识别模型,特别是对于资源匮乏的语言中库尔德语。KSTRV1数据集可在https://doi.org/10.5281/zenodo.15038953上公开获取,预计将对多语言STR、文档分析和光学字符识别(OCR)的研究做出重大贡献,促进更具包容性和准确性的文本识别系统。