Bishal Mahathir Mohammad, Chowdory Md Rakibul Hassan, Das Anik, Kabir Muhammad Ashad
Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, 4349, Bangladesh.
Department of Computer Science, St. Francis Xavier University, Antigonish, B2G 2W5, NS, Canada.
Heliyon. 2024 Jul 8;10(14):e34103. doi: 10.1016/j.heliyon.2024.e34103. eCollection 2024 Jul 30.
The COVID-19 pandemic has sparked widespread health-related discussions on social media platforms like Twitter (now named 'X'). However, the lack of labeled Twitter data poses significant challenges for theme-based classification and tweet aggregation. To address this gap, we developed a machine learning-based web application that automatically classifies COVID-19 discourses into five categories: health risks, prevention, symptoms, transmission, and treatment. We collected and labeled 6,667 COVID-19-related tweets using the Twitter API, and applied various feature extraction methods to extract relevant features. We then compared the performance of seven classical machine learning algorithms (Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbor, Logistic Regression, and Linear SVC) and four deep learning techniques (LSTM, CNN, RNN, and BERT) for classification. Our results show that the CNN achieved the highest precision (90.41%), recall (90.4%), F1 score (90.4%), and accuracy (90.4%). The Linear SVC algorithm exhibited the highest precision (85.71%), recall (86.94%), and F1 score (86.13%) among classical machine learning approaches. Our study advances the field of health-related data analysis and classification, and offers a publicly accessible web-based tool for public health researchers and practitioners. This tool has the potential to support addressing public health challenges and enhancing awareness during pandemics. The dataset and application are accessible at https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website.
新冠疫情在推特(现名为“X”)等社交媒体平台引发了广泛的健康相关讨论。然而,缺乏标注的推特数据给基于主题的分类和推文聚合带来了重大挑战。为了弥补这一差距,我们开发了一个基于机器学习的网络应用程序,可自动将新冠相关话语分为五类:健康风险、预防、症状、传播和治疗。我们使用推特应用程序编程接口收集并标注了6667条与新冠相关的推文,并应用各种特征提取方法来提取相关特征。然后,我们比较了七种经典机器学习算法(决策树、随机森林、随机梯度下降、自适应增强、K近邻、逻辑回归和线性支持向量分类器)和四种深度学习技术(长短期记忆网络、卷积神经网络、循环神经网络和双向编码器表征变换器)的分类性能。我们的结果表明,卷积神经网络的精确率最高(90.41%)、召回率最高(90.4%)、F1分数最高(90.4%)、准确率最高(90.4%)。在经典机器学习方法中,线性支持向量分类器算法的精确率最高(85.71%)、召回率最高(86.94%)、F1分数最高(86.13%)。我们的研究推动了健康相关数据分析和分类领域的发展,并为公共卫生研究人员和从业者提供了一个可公开访问的基于网络的工具。该工具有可能支持应对公共卫生挑战并在疫情期间提高认识。数据集和应用程序可在https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website上获取。