Mohan E Syam, Sunitha R
Department of Computer Science, School of Engineering and Technology, Pondicherry University, Puducherry, India.
Data Brief. 2023 Jul 26;50:109452. doi: 10.1016/j.dib.2023.109452. eCollection 2023 Oct.
Regional languages are being used more frequently in online platforms as a result of the expanding use of digital technology. Understanding user opinions on social media platforms, forums, blogs, and other digital platforms that employ Indian regional languages has become significant due to their role in various applications. Research on sentiment analysis of Indian regional language texts suffers due to the unavailability of available regional language datasets. The curated Malayalam Aspect Based Sentiment Analysis (MABSA) dataset is a labeled dataset for Aspect Based Sentiment Analysis (ABSA) on the Indian regional language Malayalam over the movie review domain. Malayalam movie reviews, an excellent source of text data for ABSA, are collected from an online survey using Google form and manually collecting reviews from three social media platforms: IMDb, Facebook, and YouTube. Nine target aspects were identified, and three annotators annotated the dataset based on the sentiment polarity of each aspect. A total of 4000 reviews were collected, and a total of 7507 aspects are identified in the reviews. Spearman's correlation and Fleiss Kappa Index are used to analyze the annotated dataset's correlation. It has been found that the high correlation between the annotators implies that the MABSA dataset is of gold standard.
由于数字技术的广泛应用,地区语言在在线平台上的使用频率越来越高。鉴于印度地区语言在各种应用中的作用,了解用户在使用印度地区语言的社交媒体平台、论坛、博客及其他数字平台上的意见变得至关重要。由于缺乏可用的地区语言数据集,对印度地区语言文本的情感分析研究受到了阻碍。精心策划的马拉雅拉姆语基于方面的情感分析(MABSA)数据集是一个标记数据集,用于在电影评论领域对印度地区语言马拉雅拉姆语进行基于方面的情感分析(ABSA)。马拉雅拉姆语电影评论是ABSA的优秀文本数据来源,通过使用谷歌表单进行在线调查,并从IMDb、Facebook和YouTube这三个社交媒体平台手动收集评论来获取。确定了九个目标方面,三位注释者根据每个方面的情感极性对数据集进行注释。总共收集了4000条评论,评论中总共识别出7507个方面。使用斯皮尔曼相关性和弗赖斯kappa指数来分析注释数据集的相关性。研究发现,注释者之间的高度相关性意味着MABSA数据集具有黄金标准。