CoViNAR：一个用于大流行严重程度预测与分析的情境感知社交媒体数据集。

CoViNAR: a context-aware social media dataset for pandemic severity level prediction and analysis.

作者信息

Shafiya Soofi, Wani Mudasir Ahmad, Jabin Suraiya, ELAffendi Mohammad

机构信息

Department of Computer Science, Faculty of Sciences, Jamia Millia Islamia, New Delhi, India.

EIAS Data Science & Blockchain Laboratory, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia.

出版信息

Front Artif Intell. 2025 Aug 20;8:1623090. doi: 10.3389/frai.2025.1623090. eCollection 2025.

DOI:10.3389/frai.2025.1623090

PMID:40910116

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12405228/

Abstract

INTRODUCTION

The unprecedented COVID-19 pandemic exposed critical weaknesses in global health management, particularly in resource allocation and demand forecasting. This study aims to enhance pandemic preparedness by leveraging real-time social media analysis to detect and monitor resource needs.

METHODS

Using SnScrape, over 27.5 million tweets for the duration of November 2019 to March 2023 were collected using COVID-19-related hashtags. Tweets from April 2021, a peak pandemic period, were selected to create the CoViNAR dataset. BERTopic enabled context-aware filtering, resulting in a novel dataset of 14,000 annotated tweets categorized as "Need", "Availability", and "Not-relevant". The CoViNAR dataset was used to train various machine learning classifiers, with experiments conducted using three context-aware word embedding techniques.

RESULTS

The best classifier, trained with DistilBERT embeddings, achieved an accuracy of 96.42%, 96.44% precision, 96.42% recall, and an F1-score of 96.43% on the Test dataset. Temporal analysis of classified tweets from the US, UK, and India between November 2019 and March 2023 revealed a strong correlation between "Need/Availability" tweet counts and COVID-19 case surges.

DISCUSSION

The results demonstrate the effectiveness of the proposed approach in capturing real-time indicators of resource shortages and availability. The strong correlation with case surges underscores its potential as a proactive tool for public health authorities, enabling improved resource allocation and early crisis intervention during pandemics.

摘要

引言

史无前例的新冠疫情暴露了全球卫生管理中的关键弱点，尤其是在资源分配和需求预测方面。本研究旨在通过利用实时社交媒体分析来检测和监测资源需求，以加强大流行防范能力。

方法

使用SnScrape，通过与新冠疫情相关的主题标签，收集了2019年11月至2023年3月期间超过2750万条推文。选取了2021年4月这一大流行高峰期的推文来创建CoViNAR数据集。BERTopic实现了上下文感知过滤，从而得到了一个包含14000条带注释推文的新数据集，这些推文被分类为“需求”、“可用性”和“不相关”。CoViNAR数据集用于训练各种机器学习分类器，并使用三种上下文感知词嵌入技术进行实验。