KSTRV1：一个用于（基于阿拉伯文的）库尔德语中部方言的场景文本识别数据集。

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) script.

作者信息

Salih Sardar Omar, Jacksi Karwan

机构信息

Web Technology Dept., Duhok Technical Institute, Duhok Polytechnic University, Duhok, Iraq.

Information Technology Dept., Technical College of Informatics, Akre University of Applied Sciences, Duhok, Iraq.

出版信息

Data Brief. 2025 May 14;60:111648. doi: 10.1016/j.dib.2025.111648. eCollection 2025 Jun.

DOI:10.1016/j.dib.2025.111648

PMID:40496736

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12151206/

Abstract

Scene Text Recognition (STR) has advanced significantly in recent years, yet languages utilizing Arabic-based scripts, such as Kurdish, remain underrepresented in existing datasets. This paper introduces KSTRV1, the first large-scale dataset designed for Kurdish Scene Text Recognition (KSTR), addressing the lack of resources for non-Latin scripts. The dataset comprises 1,420 natural scene images and 19,872 cropped word samples, covering Kurdish (Sorani and Badini dialects), Arabic, and English. Additionally, 20,000 synthetic text instances have been generated to enhance the dataset's diversity, quantity, and quality by incorporating varied fonts, orientations, distortions, and background complexities. KSTRV1 captures the multilingual landscape of the Kurdistan Region while addressing real-world challenges like occlusion, lighting variations, and script complexity. The dataset includes detailed annotations with bounding boxes, language identification, and text orientation labels, ensuring comprehensive support for training and evaluating STR models. By providing both natural and synthetic data, KSTRV1 enables the development of robust text recognition models, particularly for Central Kurdish, a low-resource language. The KSTRV1 dataset is publicly available at https://doi.org/10.5281/zenodo.15038953 and is expected to significantly contribute to research in multilingual STR, document analysis, and optical character recognition (OCR), facilitating more inclusive and accurate text recognition systems.

摘要

近年来，场景文本识别（STR）取得了显著进展，然而，使用基于阿拉伯文脚本的语言，如库尔德语，在现有数据集中的代表性仍然不足。本文介绍了KSTRV1，这是第一个为库尔德语场景文本识别（KSTR）设计的大规模数据集，解决了非拉丁脚本资源匮乏的问题。该数据集包含1420张自然场景图像和19872个裁剪后的单词样本，涵盖库尔德语（索拉尼语和巴迪尼方言）、阿拉伯语和英语。此外，还生成了20000个合成文本实例，通过纳入不同的字体、方向、变形和背景复杂性来提高数据集的多样性、数量和质量。KSTRV1捕捉了库尔德地区的多语言环境，同时解决了诸如遮挡、光照变化和脚本复杂性等现实世界挑战。该数据集包括带有边界框、语言识别和文本方向标签的详细注释，确保为训练和评估STR模型提供全面支持。通过提供自然和合成数据，KSTRV1能够开发强大的文本识别模型，特别是对于资源匮乏的语言中库尔德语。KSTRV1数据集可在https://doi.org/10.5281/zenodo.15038953上公开获取，预计将对多语言STR、文档分析和光学字符识别（OCR）的研究做出重大贡献，促进更具包容性和准确性的文本识别系统。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

KSTRV1：一个用于（基于阿拉伯文的）库尔德语中部方言的场景文本识别数据集。

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) script.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

KSTRV1：一个用于（基于阿拉伯文的）库尔德语中部方言的场景文本识别数据集。

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) script.

作者信息

机构信息

出版信息

相似文献

本文引用的文献