Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.
BACKGROUND: In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE: The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS: Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS: Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state-level geolocations. CONCLUSIONS: We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and US state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
背景:在美国,COVID-19 疫情迅速蔓延,检测试剂短缺,检测结果延迟,这给仅基于检测来主动监测其传播带来了挑战。
目的:本研究旨在开发、评估和部署一个自动自然语言处理管道,以收集用户生成的 Twitter 数据,作为识别美国潜在 COVID-19 病例的补充资源,这些病例不是基于检测的,因此可能没有向疾病控制与预防中心报告。
方法:从 2020 年 1 月 23 日开始,我们从 Twitter 流媒体应用程序编程接口中收集提及 COVID-19 相关关键词的英语推文。我们应用手写正则表达式来识别暗示用户可能接触过 COVID-19 的推文。我们自动从与正则表达式匹配的推文中过滤出“报告性言论”(例如引语、新闻标题),两名注释者对 8976 条带有地理位置标签或个人资料位置元数据的随机样本进行注释,以区分自我报告潜在 COVID-19 病例的推文和未报告的推文。我们使用经过注释的推文来训练和评估基于来自转换器的双向编码器表示的深度神经网络分类器(BERT)。最后,我们在 2020 年 3 月 1 日至 8 月 21 日期间连续收集的超过 8500 万条未标记的推文中部署了自动管道。
结果:基于对 8976 条推文的 3644 条(41%)的双重注释,注释者间的一致性为 0.77(Cohen κ)。基于在与 COVID-19 相关的推文上进行预训练的 BERT 模型的深度神经网络分类器,对自我报告潜在 COVID-19 病例的推文的检测准确率为 0.76(精确率=0.76,召回率=0.76)。在部署我们的自动管道后,我们确定了 13714 条自我报告潜在 COVID-19 病例且具有美国州级地理位置的推文。
结论:我们公开提供了在这项研究中确定的 13714 条推文,以及每条推文的时间戳和美国州级地理位置,以供下载。这个数据集为未来利用 Twitter 数据作为跟踪 COVID-19 传播的补充资源的工作提供了机会。
J Med Internet Res. 2021-1-22
J Med Internet Res. 2022-11-17
JMIR Public Health Surveill. 2025-2-3
Sci Rep. 2023-11-27
J Med Internet Res. 2023-10-31
Front Public Health. 2022
Front Artif Intell. 2023-3-14
Public Health Nurs. 2020-9-16
J Med Internet Res. 2020-10-2
J Gen Intern Med. 2020-9
J Am Med Inform Assoc. 2020-8-1
Otolaryngol Head Neck Surg. 2020-6-2