Agrawal Renuka, Majumder Mehuli, Yadav Ishita, Taneja Nandini, Hamdare Safa, Hemnani Preeti
Symbiosis Institute of Technology - Pune Campus, Symbiosis International (Deemed University), Pune, India.
Nottingham Trent University-Cliffton Campus, Nottingham, UK.
MethodsX. 2025 May 30;14:103407. doi: 10.1016/j.mex.2025.103407. eCollection 2025 Jun.
This study investigates public sentiment toward COVID-19 vaccinations by analyzing Twitter data using advanced machine learning (ML) and natural language processing (NLP) techniques. Recognizing social media as a valuable source for gauging public opinion during health crises, the research aims to inform policies on content moderation and misinformation control.•Comparative Analysis of Embedding Techniques and ML Models: The study evaluates two embedding techniques-TF-IDF and Word2Vec-across five ML models: LinearSVC, Random Forest, Gradient Boosting Machine (GBM), XGBoost, and AdaBoost.•The models were tested using two training-testing splits (70-30 and 80-20) to assess their performance on noisy, unlabeled, and imbalanced sentiment data.•Utilization of DistilBERT for Pseudo-Labeling: To enhance labeling accuracy, DistilBERT was employed for pseudo-labeling, capturing semantic nuances often missed by traditional ML techniques. This approach enabled more effective sentiment classification of tweets. The findings underscore the effectiveness of automated annotation, hybrid modeling, and embedding strategies in analyzing unstructured social media data. Such approaches provide valuable insights for public health applications, particularly in understanding vaccine hesitancy and shaping communication strategies. The study highlights the potential of integrating advanced NLP techniques to better comprehend and respond to public sentiments during pandemics or similar emergencies.
本研究通过使用先进的机器学习(ML)和自然语言处理(NLP)技术分析推特数据,调查公众对新冠疫苗接种的情绪。该研究认识到社交媒体是在健康危机期间衡量公众舆论的宝贵来源,旨在为内容审核和错误信息控制政策提供参考。•嵌入技术和ML模型的比较分析:该研究在五个ML模型(线性支持向量分类器、随机森林、梯度提升机(GBM)、极端梯度提升和自适应增强)中评估了两种嵌入技术——词频-逆文档频率和词向量。•使用两种训练-测试划分(70-30和80-20)对模型进行测试,以评估它们在嘈杂、未标记和不平衡的情绪数据上的性能。•利用DistilBERT进行伪标签:为了提高标签准确性,采用DistilBERT进行伪标签,捕捉传统ML技术经常遗漏的语义细微差别。这种方法使推文的情感分类更有效。研究结果强调了自动标注、混合建模和嵌入策略在分析非结构化社交媒体数据方面的有效性。这些方法为公共卫生应用提供了有价值的见解,特别是在理解疫苗犹豫和制定沟通策略方面。该研究强调了整合先进NLP技术以更好地理解和应对大流行或类似紧急情况期间公众情绪的潜力。