• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

KSTRV1:一个用于(基于阿拉伯文的)库尔德语中部方言的场景文本识别数据集。

KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) script.

作者信息

Salih Sardar Omar, Jacksi Karwan

机构信息

Web Technology Dept., Duhok Technical Institute, Duhok Polytechnic University, Duhok, Iraq.

Information Technology Dept., Technical College of Informatics, Akre University of Applied Sciences, Duhok, Iraq.

出版信息

Data Brief. 2025 May 14;60:111648. doi: 10.1016/j.dib.2025.111648. eCollection 2025 Jun.

DOI:10.1016/j.dib.2025.111648
PMID:40496736
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12151206/
Abstract

Scene Text Recognition (STR) has advanced significantly in recent years, yet languages utilizing Arabic-based scripts, such as Kurdish, remain underrepresented in existing datasets. This paper introduces KSTRV1, the first large-scale dataset designed for Kurdish Scene Text Recognition (KSTR), addressing the lack of resources for non-Latin scripts. The dataset comprises 1,420 natural scene images and 19,872 cropped word samples, covering Kurdish (Sorani and Badini dialects), Arabic, and English. Additionally, 20,000 synthetic text instances have been generated to enhance the dataset's diversity, quantity, and quality by incorporating varied fonts, orientations, distortions, and background complexities. KSTRV1 captures the multilingual landscape of the Kurdistan Region while addressing real-world challenges like occlusion, lighting variations, and script complexity. The dataset includes detailed annotations with bounding boxes, language identification, and text orientation labels, ensuring comprehensive support for training and evaluating STR models. By providing both natural and synthetic data, KSTRV1 enables the development of robust text recognition models, particularly for Central Kurdish, a low-resource language. The KSTRV1 dataset is publicly available at https://doi.org/10.5281/zenodo.15038953 and is expected to significantly contribute to research in multilingual STR, document analysis, and optical character recognition (OCR), facilitating more inclusive and accurate text recognition systems.

摘要

近年来,场景文本识别(STR)取得了显著进展,然而,使用基于阿拉伯文脚本的语言,如库尔德语,在现有数据集中的代表性仍然不足。本文介绍了KSTRV1,这是第一个为库尔德语场景文本识别(KSTR)设计的大规模数据集,解决了非拉丁脚本资源匮乏的问题。该数据集包含1420张自然场景图像和19872个裁剪后的单词样本,涵盖库尔德语(索拉尼语和巴迪尼方言)、阿拉伯语和英语。此外,还生成了20000个合成文本实例,通过纳入不同的字体、方向、变形和背景复杂性来提高数据集的多样性、数量和质量。KSTRV1捕捉了库尔德地区的多语言环境,同时解决了诸如遮挡、光照变化和脚本复杂性等现实世界挑战。该数据集包括带有边界框、语言识别和文本方向标签的详细注释,确保为训练和评估STR模型提供全面支持。通过提供自然和合成数据,KSTRV1能够开发强大的文本识别模型,特别是对于资源匮乏的语言中库尔德语。KSTRV1数据集可在https://doi.org/10.5281/zenodo.15038953上公开获取,预计将对多语言STR、文档分析和光学字符识别(OCR)的研究做出重大贡献,促进更具包容性和准确性的文本识别系统。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/095d237f66d8/gr13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/9ae7e0c9d06c/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/a36004681f4e/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/9b07c28bc14b/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/51ffc64e5e1c/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/dbad3045dbf0/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/4d3afbdda73a/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/674c8025eaa4/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/b7722f9e0547/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/bc280ecbaf76/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/a5241accbc78/gr10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/855404a49e42/gr11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/1439934e1f05/gr12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/095d237f66d8/gr13.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/9ae7e0c9d06c/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/a36004681f4e/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/9b07c28bc14b/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/51ffc64e5e1c/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/dbad3045dbf0/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/4d3afbdda73a/gr6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/674c8025eaa4/gr7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/b7722f9e0547/gr8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/bc280ecbaf76/gr9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/a5241accbc78/gr10.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/855404a49e42/gr11.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/1439934e1f05/gr12.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/92e3/12151206/095d237f66d8/gr13.jpg

相似文献

1
KSTRV1: A scene text recognition dataset for central Kurdish in (Arabic-Based) script.KSTRV1:一个用于(基于阿拉伯文的)库尔德语中部方言的场景文本识别数据集。
Data Brief. 2025 May 14;60:111648. doi: 10.1016/j.dib.2025.111648. eCollection 2025 Jun.
2
Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本:用于自然场景图像中乌尔都语文本端到端识别的综合数据集。
Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.
3
Improving Scene Text Recognition for Indian Languages with Transfer Learning and Font Diversity.通过迁移学习和字体多样性改进印度语言的场景文本识别
J Imaging. 2022 Mar 23;8(4):86. doi: 10.3390/jimaging8040086.
4
Dataset for the recognition of Kurdish sound dialects.库尔德语音方言识别数据集。
Data Brief. 2024 Feb 22;53:110231. doi: 10.1016/j.dib.2024.110231. eCollection 2024 Apr.
5
Holy Quran Kurdish Sorani translation dataset for language modelling.用于语言建模的《古兰经》库尔德语索拉尼语翻译数据集。
Data Brief. 2025 Apr 3;60:111533. doi: 10.1016/j.dib.2025.111533. eCollection 2025 Jun.
6
Multilingual character recognition dataset for Moroccan official documents.摩洛哥官方文件的多语言字符识别数据集。
Data Brief. 2023 Dec 13;52:109953. doi: 10.1016/j.dib.2023.109953. eCollection 2024 Feb.
7
Automated compilation of Urdu poetry handwritten image datasets for optical character recognition.用于光学字符识别的乌尔都语诗歌手写图像数据集的自动编译。
MethodsX. 2024 Dec 21;14:103130. doi: 10.1016/j.mex.2024.103130. eCollection 2025 Jun.
8
KuSL2023: A standard for Kurdish sign language detection and classification using hand tracking and machine learning.KuSL2023:一种使用手部跟踪和机器学习进行库尔德手语检测与分类的标准。
MethodsX. 2025 May 16;14:103374. doi: 10.1016/j.mex.2025.103374. eCollection 2025 Jun.
9
Kurdish Handwritten character recognition using deep learning techniques.基于深度学习技术的库尔德手写字符识别。
Gene Expr Patterns. 2022 Dec;46:119278. doi: 10.1016/j.gep.2022.119278. Epub 2022 Oct 3.
10
A scarce dataset for ancient Arabic handwritten text recognition.用于古代阿拉伯手写文本识别的稀缺数据集。
Data Brief. 2024 Aug 8;56:110813. doi: 10.1016/j.dib.2024.110813. eCollection 2024 Oct.

本文引用的文献

1
Kurdish standard EMNIST-like character dataset.库尔德标准类EMNIST字符数据集。
Data Brief. 2024 Jan 9;52:110038. doi: 10.1016/j.dib.2024.110038. eCollection 2024 Feb.
2
A vast dataset for Kurdish handwritten digits and isolated characters recognition.
Data Brief. 2023 Mar 2;47:109014. doi: 10.1016/j.dib.2023.109014. eCollection 2023 Apr.
3
Urdu text in natural scene images: a new dataset and preliminary text detection.自然场景图像中的乌尔都语文本:一个新数据集及初步文本检测
PeerJ Comput Sci. 2021 Sep 16;7:e717. doi: 10.7717/peerj-cs.717. eCollection 2021.