Wady Shakhawan Hares, Badawi Soran, Kurt Fatih
Department of Business Administration, Charmo University, KRG, Chamchamal, Kurdistan, Iraq.
Language Center, Charmo University, KRG, Chamchamal, Kurdistan, Iraq.
Data Brief. 2024 Sep 28;57:110967. doi: 10.1016/j.dib.2024.110967. eCollection 2024 Dec.
Sentiment analysis is an essential task that involves the extraction, identification, characterization, and classification of textual data to understand and categorize the attitudes and opinions expressed by individuals. While other languages have extensive datasets in this field, the number of sentiment analysis datasets in the Kurdish language is extremely limited, highlighting the necessity to build datasets for the language to advance its development. This paper presents a Twitter dataset comprising 24,668 tweets from the initial sample of 30,009 texts. Human annotators labelled the tweets based on subjectivity, sentiment, offensiveness, and target. After the initial annotation, an independent reviewer examined all labelled data to ensure the construction of a robust dataset. The cleaned dataset includes 8772 subjective tweets and 15,896 non-subjective tweets. Regarding sentiment, 12,938 were classified as negative, 3189 as neutral, and 8541 as positive. Moreover, 22,436 were non-offensive tweets, while 2232 were offensive. Additionally, the dataset distinguishes between targeted and non-targeted tweets, with 22,436 tweets not aimed at specific individuals or entities, and 2232 tweets directed towards particular targets. This dataset serves as an essential resource for scholars in the field to build state-of-the-art models for the Kurdish language
情感分析是一项重要任务,它涉及对文本数据的提取、识别、特征描述和分类,以理解和分类个人表达的态度和观点。虽然其他语言在该领域有大量数据集,但库尔德语的情感分析数据集数量极其有限,这凸显了为该语言构建数据集以推动其发展的必要性。本文展示了一个推特数据集,该数据集包含从30,009篇文本的初始样本中选取的24,668条推文。人工标注人员根据主观性、情感、冒犯性和目标对推文进行了标注。在初始标注之后,一名独立审核人员检查了所有标注数据,以确保构建一个可靠的数据集。清理后的数据集包括8772条主观推文和15,896条非主观推文。在情感方面,12,938条被分类为负面,3189条为中性,8541条为正面。此外,22,436条是非冒犯性推文,而2232条是冒犯性推文。此外,该数据集区分了有针对性和无针对性的推文,其中22,436条推文不是针对特定个人或实体,2232条推文是针对特定目标的。这个数据集是该领域学者为库尔德语构建最先进模型的重要资源。