Gandy Lisa M, Ivanitskaya Lana V, Bacon Leeza L, Bizri-Baryak Rodina
Department of Computer Science, College of Sciences and Liberal Arts, Kettering University, Flint, MI, United States.
Department of Health Administration, The College of Health Professions, Central Michigan University, Mt Pleasant, MI, United States.
JMIR Form Res. 2025 Jan 8;9:e57395. doi: 10.2196/57395.
Sentiment analysis is one of the most widely used methods for mining and examining text. Social media researchers need guidance on choosing between manual and automated sentiment analysis methods.
Popular sentiment analysis tools based on natural language processing (NLP; VADER [Valence Aware Dictionary for Sentiment Reasoning], TEXT2DATA [T2D], and Linguistic Inquiry and Word Count [LIWC-22]), and a large language model (ChatGPT 4.0) were compared with manually coded sentiment scores, as applied to the analysis of YouTube comments on videos discussing the opioid epidemic. Sentiment analysis methods were also examined regarding ease of programming, monetary cost, and other practical considerations.
Evaluation methods included descriptive statistics, receiver operating characteristic (ROC) curve analysis, confusion matrices, Cohen κ, accuracy, specificity, precision, sensitivity (recall), F-score harmonic mean, and the Matthews correlation coefficient. An inductive, iterative approach to content analysis of the data was used to obtain manual sentiment codes.
A subset of comments were analyzed by a second coder, producing good agreement between the 2 coders' judgments (κ=0.734). YouTube social media about the opioid crisis had many more negative comments (4286/4871, 88%) than positive comments (79/662, 12%), making it possible to evaluate the performance of sentiment analysis models in an unbalanced dataset. The tone summary measure from LIWC-22 performed better than other tools for estimating the prevalence of negative versus positive sentiment. According to the ROC curve analysis, VADER was best at classifying manually coded negative comments. A comparison of Cohen κ values indicated that NLP tools (VADER, followed by LIWC's tone and T2D) showed only fair agreement with manual coding. In contrast, ChatGPT 4.0 had poor agreement and failed to generate binary sentiment scores in 2 out of 3 attempts. Variations in accuracy, specificity, precision, sensitivity, F-score, and MCC did not reveal a single superior model. F-score harmonic means were 0.34-0.38 (SD 0.02) for NLP tools and very low (0.13) for ChatGPT 4.0. None of the MCCs reached a strong correlation level.
Researchers studying negative emotions, public worries, or dissatisfaction with social media face unique challenges in selecting models suitable for unbalanced datasets. We recommend VADER, the only cost-free tool we evaluated, due to its excellent discrimination, which can be further improved when the comments are at least 100 characters long. If estimating the prevalence of negative comments in an unbalanced dataset is important, we recommend the tone summary measure from LIWC-22. Researchers using T2D must know that it may only score some data and, compared with other methods, be more time-consuming and cost-prohibitive. A general-purpose large language model, ChatGPT 4.0, has yet to surpass the performance of NLP models, at least for unbalanced datasets with highly prevalent (7:1) negative comments.
情感分析是文本挖掘和研究中使用最广泛的方法之一。社交媒体研究人员在选择手动和自动情感分析方法时需要指导。
将基于自然语言处理的流行情感分析工具(NLP;情感推理的价态感知词典[VADER]、文本到数据[T2D]和语言查询与字数统计[LIWC-22])以及一个大语言模型(ChatGPT 4.0)与手动编码的情感分数进行比较,应用于分析YouTube上关于讨论阿片类药物流行的视频的评论。还从编程的难易程度、货币成本和其他实际考虑因素方面对情感分析方法进行了研究。
评估方法包括描述性统计、受试者操作特征(ROC)曲线分析、混淆矩阵、科恩κ系数、准确性、特异性、精确性、敏感性(召回率)、F分数调和均值以及马修斯相关系数。采用归纳、迭代的方法对数据进行内容分析以获得手动情感编码。
一部分评论由第二位编码员进行分析,两位编码员的判断之间达成了良好的一致性(κ=0.734)。关于阿片类药物危机的YouTube社交媒体上负面评论(4286/4871,88%)比正面评论(79/662,12%)多得多,这使得在不平衡数据集中评估情感分析模型的性能成为可能。LIWC-22的语气总结度量在估计负面与正面情感的普遍性方面比其他工具表现更好。根据ROC曲线分析,VADER在对手动编码的负面评论进行分类方面表现最佳。科恩κ值的比较表明,NLP工具(VADER,其次是LIWC的语气和T2D)与手动编码的一致性仅为一般。相比之下,ChatGPT 4.0的一致性较差,在3次尝试中有2次未能生成二元情感分数。准确性、特异性、精确性、敏感性、F分数和MCC的变化并未揭示出单一的优越模型。NLP工具的F分数调和均值为0.34 - 0.38(标准差0.02),ChatGPT 4.0的则非常低(0.13)。没有一个MCC达到强相关水平。
研究负面情绪、公众担忧或对社交媒体不满的研究人员在选择适合不平衡数据集的模型时面临独特挑战。由于其出色的辨别能力,我们推荐VADER,这是我们评估的唯一免费工具,当评论至少100个字符长时,其性能可进一步提高。如果在不平衡数据集中估计负面评论的普遍性很重要,我们推荐LIWC-22的语气总结度量。使用T2D的研究人员必须知道,它可能只能对一些数据进行评分,并且与其他方法相比,更耗时且成本高昂。一个通用的大语言模型ChatGPT 4.0尚未超越NLP模型的性能,至少对于具有高度普遍(7:1)负面评论的不平衡数据集是如此。