Chatterjee Ishani, Zhou Mengchu, Abusorrah Abdullah, Sedraoui Khaled, Alabdulwahab Ahmed
Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA.
Department of Electrical and Computer Engineering, Faculty of Engineering, and Center of Research Excellence in Renewable Energy and Power Systems, King Abdulaziz University, Jeddah 21481, Saudi Arabia.
Entropy (Basel). 2021 Dec 7;23(12):1645. doi: 10.3390/e23121645.
People nowadays use the internet to project their assessments, impressions, ideas, and observations about various subjects or products on numerous social networking sites. These sites serve as a great source to gather data for data analytics, sentiment analysis, natural language processing, etc. Conventionally, the true sentiment of a customer review matches its corresponding star rating. There are exceptions when the star rating of a review is opposite to its true nature. These are labeled as the outliers in a dataset in this work. The state-of-the-art methods for anomaly detection involve manual searching, predefined rules, or traditional machine learning techniques to detect such instances. This paper conducts a sentiment analysis and outlier detection case study for Amazon customer reviews, and it proposes a statistics-based outlier detection and correction method (SODCM), which helps identify such reviews and rectify their star ratings to enhance the performance of a sentiment analysis algorithm without any data loss. This paper focuses on performing SODCM in datasets containing customer reviews of various products, which are (a) scraped from Amazon.com and (b) publicly available. The paper also studies the dataset and concludes the effect of SODCM on the performance of a sentiment analysis algorithm. The results exhibit that SODCM achieves higher accuracy and recall percentage than other state-of-the-art anomaly detection algorithms.
如今,人们利用互联网在众多社交网站上展示他们对各种主题或产品的评价、印象、想法和观察结果。这些网站是收集数据分析、情感分析、自然语言处理等数据的重要来源。传统上,客户评论的真实情感与相应的星级评级相符。但也有例外情况,即评论的星级评级与其真实性质相反。在本研究中,这些被标记为数据集中的异常值。目前先进的异常检测方法包括人工搜索、预定义规则或传统机器学习技术来检测此类情况。本文针对亚马逊客户评论进行了情感分析和异常值检测案例研究,并提出了一种基于统计的异常值检测与修正方法(SODCM),该方法有助于识别此类评论并修正其星级评级,以提高情感分析算法的性能且不造成任何数据损失。本文重点在包含各种产品客户评论的数据集上执行SODCM,这些数据集(a)是从Amazon.com上抓取的,(b)是公开可用的。本文还研究了该数据集,并总结了SODCM对情感分析算法性能的影响。结果表明,SODCM比其他先进的异常检测算法具有更高的准确率和召回率。