Suppr超能文献

使用词频-逆文档频率和优化的机器学习算法对电影评论进行分类。

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms.

作者信息

Naeem Muhammad Zaid, Rustam Furqan, Mehmood Arif, Ashraf Imran, Choi Gyu Sang

机构信息

Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan.

Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan.

出版信息

PeerJ Comput Sci. 2022 Mar 15;8:e914. doi: 10.7717/peerj-cs.914. eCollection 2022.

Abstract

The Internet Movie Database (IMDb), being one of the popular online databases for movies and personalities, provides a wide range of movie reviews from millions of users. This provides a diverse and large dataset to analyze users' sentiments about various personalities and movies. Despite being helpful to provide the critique of movies, the reviews on IMDb cannot be read as a whole and requires automated tools to provide insights on the sentiments in such reviews. This study provides the implementation of various machine learning models to measure the polarity of the sentiments presented in user reviews on the IMDb website. For this purpose, the reviews are first preprocessed to remove redundant information and noise, and then various classification models like support vector machines (SVM), Naïve Bayes classifier, random forest, and gradient boosting classifiers are used to predict the sentiment of these reviews. The objective is to find the optimal process and approach to attain the highest accuracy with the best generalization. Various feature engineering approaches such as term frequency-inverse document frequency (TF-IDF), bag of words, global vectors for word representations, and Word2Vec are applied along with the hyperparameter tuning of the classification models to enhance the classification accuracy. Experimental results indicate that the SVM obtains the highest accuracy when used with TF-IDF features and achieves an accuracy of 89.55%. The sentiment classification accuracy of the models is affected due to the contradictions in the user sentiments in the reviews and assigned labels. For tackling this issue, TextBlob is used to assign a sentiment to the dataset containing reviews before it can be used for training. Experimental results on TextBlob assigned sentiments indicate that an accuracy of 92% can be obtained using the proposed model.

摘要

互联网电影数据库(IMDb)是最受欢迎的电影和人物在线数据库之一,它提供了来自数百万用户的大量电影评论。这为分析用户对各种人物和电影的情感提供了一个多样且庞大的数据集。尽管IMDb上的评论有助于对电影进行批评,但无法一次性全部阅读这些评论,因此需要自动化工具来洞察其中的情感。本研究实现了各种机器学习模型,以衡量IMDb网站上用户评论中所表达情感的极性。为此,首先对评论进行预处理以去除冗余信息和噪声,然后使用支持向量机(SVM)、朴素贝叶斯分类器、随机森林和梯度提升分类器等各种分类模型来预测这些评论的情感。目标是找到最佳流程和方法,以实现最高精度和最佳泛化能力。应用了各种特征工程方法,如词频 - 逆文档频率(TF-IDF)、词袋模型、词表示全局向量和Word2Vec,同时对分类模型进行超参数调整以提高分类精度。实验结果表明,SVM与TF-IDF特征一起使用时获得了最高精度,达到了89.55%。由于评论中用户情感和所分配标签存在矛盾,模型的情感分类精度受到影响。为了解决这个问题,在将包含评论的数据集用于训练之前,使用TextBlob为其分配情感。对TextBlob分配情感后的实验结果表明,使用所提出的模型可以获得92%的准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/994e/9044332/c8a74a600c85/peerj-cs-08-914-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验