使用词频-逆文档频率和优化的机器学习算法对电影评论进行分类。

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms.

作者信息

Naeem Muhammad Zaid, Rustam Furqan, Mehmood Arif, Ashraf Imran, Choi Gyu Sang

机构信息

Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan.

Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur, Pakistan.

出版信息

PeerJ Comput Sci. 2022 Mar 15;8:e914. doi: 10.7717/peerj-cs.914. eCollection 2022.

DOI:10.7717/peerj-cs.914

PMID:35494818

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9044332/

Abstract

The Internet Movie Database (IMDb), being one of the popular online databases for movies and personalities, provides a wide range of movie reviews from millions of users. This provides a diverse and large dataset to analyze users' sentiments about various personalities and movies. Despite being helpful to provide the critique of movies, the reviews on IMDb cannot be read as a whole and requires automated tools to provide insights on the sentiments in such reviews. This study provides the implementation of various machine learning models to measure the polarity of the sentiments presented in user reviews on the IMDb website. For this purpose, the reviews are first preprocessed to remove redundant information and noise, and then various classification models like support vector machines (SVM), Naïve Bayes classifier, random forest, and gradient boosting classifiers are used to predict the sentiment of these reviews. The objective is to find the optimal process and approach to attain the highest accuracy with the best generalization. Various feature engineering approaches such as term frequency-inverse document frequency (TF-IDF), bag of words, global vectors for word representations, and Word2Vec are applied along with the hyperparameter tuning of the classification models to enhance the classification accuracy. Experimental results indicate that the SVM obtains the highest accuracy when used with TF-IDF features and achieves an accuracy of 89.55%. The sentiment classification accuracy of the models is affected due to the contradictions in the user sentiments in the reviews and assigned labels. For tackling this issue, TextBlob is used to assign a sentiment to the dataset containing reviews before it can be used for training. Experimental results on TextBlob assigned sentiments indicate that an accuracy of 92% can be obtained using the proposed model.

摘要

互联网电影数据库（IMDb）是最受欢迎的电影和人物在线数据库之一，它提供了来自数百万用户的大量电影评论。这为分析用户对各种人物和电影的情感提供了一个多样且庞大的数据集。尽管IMDb上的评论有助于对电影进行批评，但无法一次性全部阅读这些评论，因此需要自动化工具来洞察其中的情感。本研究实现了各种机器学习模型，以衡量IMDb网站上用户评论中所表达情感的极性。为此，首先对评论进行预处理以去除冗余信息和噪声，然后使用支持向量机（SVM）、朴素贝叶斯分类器、随机森林和梯度提升分类器等各种分类模型来预测这些评论的情感。目标是找到最佳流程和方法，以实现最高精度和最佳泛化能力。应用了各种特征工程方法，如词频 - 逆文档频率（TF-IDF）、词袋模型、词表示全局向量和Word2Vec，同时对分类模型进行超参数调整以提高分类精度。实验结果表明，SVM与TF-IDF特征一起使用时获得了最高精度，达到了89.55%。由于评论中用户情感和所分配标签存在矛盾，模型的情感分类精度受到影响。为了解决这个问题，在将包含评论的数据集用于训练之前，使用TextBlob为其分配情感。对TextBlob分配情感后的实验结果表明，使用所提出的模型可以获得92%的准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/994e/9044332/c8a74a600c85/peerj-cs-08-914-g001.jpg

相似文献

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms.使用词频-逆文档频率和优化的机器学习算法对电影评论进行分类。

PeerJ Comput Sci. 2022 Mar 15;8:e914. doi: 10.7717/peerj-cs.914. eCollection 2022.

Sentiment classification for employees reviews using regression vector- stochastic gradient descent classifier (RV-SGDC).使用回归向量-随机梯度下降分类器（RV-SGDC）对员工评价进行情感分类。

PeerJ Comput Sci. 2021 Sep 23;7:e712. doi: 10.7717/peerj-cs.712. eCollection 2021.

ETCNN: Extra Tree and Convolutional Neural Network-based Ensemble Model for COVID-19 Tweets Sentiment Classification.ETCNN：基于Extra Tree和卷积神经网络的COVID-19推文情感分类集成模型

Pattern Recognit Lett. 2022 Dec;164:224-231. doi: 10.1016/j.patrec.2022.11.012. Epub 2022 Nov 15.

Enhancing machine learning-based sentiment analysis through feature extraction techniques.通过特征提取技术增强基于机器学习的情感分析。

PLoS One. 2024 Feb 14;19(2):e0294968. doi: 10.1371/journal.pone.0294968. eCollection 2024.

Self voting classification model for online meeting app review sentiment analysis and topic modeling.用于在线会议应用程序评论情感分析和主题建模的自投票分类模型。

PeerJ Comput Sci. 2022 Dec 15;8:e1141. doi: 10.7717/peerj-cs.1141. eCollection 2022.

Temporal analysis and opinion dynamics of COVID-19 vaccination tweets using diverse feature engineering techniques.使用多种特征工程技术对新冠疫苗接种推文进行时间分析和观点动态分析。

PeerJ Comput Sci. 2023 Mar 10;9:e1190. doi: 10.7717/peerj-cs.1190. eCollection 2023.

Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data.通过TF-IDF和Word2vec文本分析研究反应行为：以2012年国际学生评估项目（PISA）解决问题过程数据为例

Heliyon. 2024 Aug 10;10(16):e35945. doi: 10.1016/j.heliyon.2024.e35945. eCollection 2024 Aug 30.

Sentiment Analysis and Comprehensive Evaluation of Supervised Machine Learning Models Using Twitter Data on Russia-Ukraine War.使用关于俄乌战争的推特数据对监督式机器学习模型进行情感分析与综合评估

SN Comput Sci. 2023;4(4):346. doi: 10.1007/s42979-023-01790-5. Epub 2023 Apr 21.

Comprehension of polarity of articles by citation sentiment analysis using TF-IDF and ML classifiers.使用TF-IDF和机器学习分类器通过引用情感分析理解文章的极性。

PeerJ Comput Sci. 2022 Dec 13;8:e1107. doi: 10.7717/peerj-cs.1107. eCollection 2022.

"When 'Bad' is 'Good'": Identifying Personal Communication and Sentiment in Drug-Related Tweets.当“负面”即“正面”：识别与毒品相关推文中的个人交流和情感倾向

JMIR Public Health Surveill. 2016 Oct 24;2(2):e162. doi: 10.2196/publichealth.6327.

引用本文的文献

Prediction of sentiment polarity in restaurant reviews using an ordinal regression approach based on evolutionary XGBoost.基于进化XGBoost的有序回归方法在餐厅评论情感极性预测中的应用

PeerJ Comput Sci. 2025 Jan 9;11:e2370. doi: 10.7717/peerj-cs.2370. eCollection 2025.

Analysis and prediction of research hotspots and trends in heart failure research.心力衰竭研究热点与趋势的分析及预测

J Transl Int Med. 2024 Jul 27;12(3):263-273. doi: 10.2478/jtim-2023-0117. eCollection 2024 Jun.

American literature news narration based on computer web technology.基于计算机网络技术的美国文学新闻叙述。

PLoS One. 2023 Oct 16;18(10):e0292446. doi: 10.1371/journal.pone.0292446. eCollection 2023.

Analysis and prediction of research hotspots and trends in pediatric medicine from 2,580,642 studies published between 1940 and 2021.对1940年至2021年间发表的2580642项研究中儿科学术热点和趋势的分析与预测。

World J Pediatr. 2023 Aug;19(8):793-797. doi: 10.1007/s12519-023-00731-9. Epub 2023 Jun 9.

Front Psychol. 2022 Nov 10;13:992890. doi: 10.3389/fpsyg.2022.992890. eCollection 2022.

本文引用的文献

Deepfake tweets classification using stacked Bi-LSTM and words embedding.基于堆叠双向长短期记忆网络和词嵌入的深度伪造推文分类

PeerJ Comput Sci. 2021 Oct 21;7:e745. doi: 10.7717/peerj-cs.745. eCollection 2021.

A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis.监督机器学习模型在新冠病毒推文情感分析中的性能比较。

PLoS One. 2021 Feb 25;16(2):e0245909. doi: 10.1371/journal.pone.0245909. eCollection 2021.

Floor Identification Using Magnetic Field Data With Smartphone Sensors.利用智能手机传感器的磁场数据进行楼层识别

Sensors (Basel). 2019 Jun 3;19(11):2538. doi: 10.3390/s19112538.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用词频-逆文档频率和优化的机器学习算法对电影评论进行分类。

Classification of movie reviews using term frequency-inverse document frequency and optimized machine learning algorithms.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献