Soomro Mudasar Ahmed, Memon Rafia Naz, Chandio Asghar Ali, Leghari Mehwish, Soomro Muhammad Hanif
Department of Information Technology, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan.
Department of Software Engineering, Quaid-e-Awam University of Engineering, Science & Technology, Nawabshah, Pakistan.
Data Brief. 2024 Nov 23;57:111170. doi: 10.1016/j.dib.2024.111170. eCollection 2024 Dec.
Roman Urdu text is very widespread on many websites. People mostly prefer to give their social comments or product reviews in Roman Urdu, and Roman Urdu is counted as non-standard language. The main reason for this is that there is no rule for word spellings within Roman Urdu words, so people create and post their own word spellings, like "2mro" is a nonstandard spelling for tomorrow. This paper aims to collect two Roman Urdu datasets: one is roman Urdu words with various spelling variations. This dataset contains 5244 Roman Urdu words, within which we have included variations in word spellings ranging from (one) to (five) different spellings for each word. The second dataset consists of Roman Urdu reviews, which were collected from (seven) different internet-based sources. This dataset contains multiclass reviews, namely "very positive," "positive," "very negative," "negative," and "neutral", respectively. We gathered a total of 28,090 reviews. The sentiments of the reviews were made by the domain experts who were familiar with the Urdu language.
罗马乌尔都语文本在许多网站上非常普遍。人们大多喜欢用罗马乌尔都语发表社交评论或产品评价,而罗马乌尔都语被视为非标准语言。主要原因是罗马乌尔都语单词内没有单词拼写规则,所以人们创造并发布自己的单词拼写,比如“2mro”是“tomorrow”的非标准拼写。本文旨在收集两个罗马乌尔都语数据集:一个是具有各种拼写变体的罗马乌尔都语单词。这个数据集包含5244个罗马乌尔都语单词,其中我们为每个单词纳入了从(一)到(五)种不同拼写的变体。第二个数据集由罗马乌尔都语评论组成,这些评论是从(七个)不同的基于互联网的来源收集的。这个数据集包含多类评论,分别为“非常积极”、“积极”、“非常消极”、“消极”和“中性”。我们总共收集了28090条评论。评论的情感倾向由熟悉乌尔都语的领域专家判定。