Shafiya Soofi, Wani Mudasir Ahmad, Jabin Suraiya, ELAffendi Mohammad
Department of Computer Science, Faculty of Sciences, Jamia Millia Islamia, New Delhi, India.
EIAS Data Science & Blockchain Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia.
Front Artif Intell. 2025 Aug 20;8:1623090. doi: 10.3389/frai.2025.1623090. eCollection 2025.
The unprecedented COVID-19 pandemic exposed critical weaknesses in global health management, particularly in resource allocation and demand forecasting. This study aims to enhance pandemic preparedness by leveraging real-time social media analysis to detect and monitor resource needs.
Using SnScrape, over 27.5 million tweets for the duration of November 2019 to March 2023 were collected using COVID-19-related hashtags. Tweets from April 2021, a peak pandemic period, were selected to create the CoViNAR dataset. BERTopic enabled context-aware filtering, resulting in a novel dataset of 14,000 annotated tweets categorized as "Need", "Availability", and "Not-relevant". The CoViNAR dataset was used to train various machine learning classifiers, with experiments conducted using three context-aware word embedding techniques.
The best classifier, trained with DistilBERT embeddings, achieved an accuracy of 96.42%, 96.44% precision, 96.42% recall, and an F1-score of 96.43% on the Test dataset. Temporal analysis of classified tweets from the US, UK, and India between November 2019 and March 2023 revealed a strong correlation between "Need/Availability" tweet counts and COVID-19 case surges.
The results demonstrate the effectiveness of the proposed approach in capturing real-time indicators of resource shortages and availability. The strong correlation with case surges underscores its potential as a proactive tool for public health authorities, enabling improved resource allocation and early crisis intervention during pandemics.
史无前例的新冠疫情暴露了全球卫生管理中的关键弱点,尤其是在资源分配和需求预测方面。本研究旨在通过利用实时社交媒体分析来检测和监测资源需求,以加强大流行防范能力。
使用SnScrape,通过与新冠疫情相关的主题标签,收集了2019年11月至2023年3月期间超过2750万条推文。选取了2021年4月这一大流行高峰期的推文来创建CoViNAR数据集。BERTopic实现了上下文感知过滤,从而得到了一个包含14000条带注释推文的新数据集,这些推文被分类为“需求”、“可用性”和“不相关”。CoViNAR数据集用于训练各种机器学习分类器,并使用三种上下文感知词嵌入技术进行实验。
使用DistilBERT嵌入训练的最佳分类器在测试数据集上的准确率为96.42%,精确率为96.44%,召回率为96.42%,F1分数为96.43%。对2019年11月至2023年3月期间来自美国、英国和印度的分类推文进行的时间分析显示,“需求/可用性”推文数量与新冠病例激增之间存在很强的相关性。
结果表明了所提出方法在捕捉资源短缺和可用性实时指标方面的有效性。与病例激增的强相关性突出了其作为公共卫生当局主动工具的潜力,能够在大流行期间改善资源分配并进行早期危机干预。