用于追踪 COVID-19 的 Twitter：自然语言处理管道和探索性数据集。

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.

机构信息

Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.

DOI:10.2196/25314

PMID:33449904

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7834613/

Abstract

BACKGROUND

In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.

OBJECTIVE

The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention.

METHODS

Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020.

RESULTS

Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state-level geolocations.

CONCLUSIONS

We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and US state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

摘要

背景

在美国，COVID-19 疫情迅速蔓延，检测试剂短缺，检测结果延迟，这给仅基于检测来主动监测其传播带来了挑战。

目的

本研究旨在开发、评估和部署一个自动自然语言处理管道，以收集用户生成的 Twitter 数据，作为识别美国潜在 COVID-19 病例的补充资源，这些病例不是基于检测的，因此可能没有向疾病控制与预防中心报告。

方法

从 2020 年 1 月 23 日开始，我们从 Twitter 流媒体应用程序编程接口中收集提及 COVID-19 相关关键词的英语推文。我们应用手写正则表达式来识别暗示用户可能接触过 COVID-19 的推文。我们自动从与正则表达式匹配的推文中过滤出“报告性言论”（例如引语、新闻标题），两名注释者对 8976 条带有地理位置标签或个人资料位置元数据的随机样本进行注释，以区分自我报告潜在 COVID-19 病例的推文和未报告的推文。我们使用经过注释的推文来训练和评估基于来自转换器的双向编码器表示的深度神经网络分类器（BERT）。最后，我们在 2020 年 3 月 1 日至 8 月 21 日期间连续收集的超过 8500 万条未标记的推文中部署了自动管道。

结果

基于对 8976 条推文的 3644 条（41%）的双重注释，注释者间的一致性为 0.77（Cohen κ）。基于在与 COVID-19 相关的推文上进行预训练的 BERT 模型的深度神经网络分类器，对自我报告潜在 COVID-19 病例的推文的检测准确率为 0.76（精确率=0.76，召回率=0.76）。在部署我们的自动管道后，我们确定了 13714 条自我报告潜在 COVID-19 病例且具有美国州级地理位置的推文。

结论

我们公开提供了在这项研究中确定的 13714 条推文，以及每条推文的时间戳和美国州级地理位置，以供下载。这个数据集为未来利用 Twitter 数据作为跟踪 COVID-19 传播的补充资源的工作提供了机会。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/ac17d234a217/jmir_v23i1e25314_fig1.jpg

相似文献

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.用于追踪 COVID-19 的 Twitter：自然语言处理管道和探索性数据集。

J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.

A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.一种自然语言处理流程，以促进将推特数据用于不良妊娠结局的数字流行病学研究。

J Biomed Inform. 2020;112S:100076. doi: 10.1016/j.yjbinx.2020.100076. Epub 2020 Aug 8.

Automatically Identifying Twitter Users for Interventions to Support Dementia Family Caregivers: Annotated Data Set and Benchmark Classification Models.自动识别用于支持痴呆症家庭照顾者干预措施的推特用户：带注释的数据集和基准分类模型

JMIR Aging. 2022 Sep 16;5(3):e39547. doi: 10.2196/39547.

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid.开发一个自动系统来对 Twitter 上有关医疗服务的闲聊进行分类：以医疗补助计划为例。

J Med Internet Res. 2021 May 3;23(5):e26616. doi: 10.2196/26616.

Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States.为了在 PrEP 相关干预措施中使用 Twitter：一种在美国识别男同性恋或双性恋男性的自动化自然语言处理管道。

JMIR Public Health Surveill. 2022 Apr 25;8(4):e32405. doi: 10.2196/32405.

ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.ReportAGE：基于用户在推文中的自我报告自动提取 Twitter 用户的准确年龄。

PLoS One. 2022 Jan 25;17(1):e0262087. doi: 10.1371/journal.pone.0262087. eCollection 2022.

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.社交媒体挖掘在出生缺陷研究中的应用：一种基于规则和自举的方法，用于在 Twitter 上收集罕见健康相关事件的数据。

J Biomed Inform. 2018 Nov;87:68-78. doi: 10.1016/j.jbi.2018.10.001. Epub 2018 Oct 4.

Temporal and Location Variations, and Link Categories for the Dissemination of COVID-19-Related Information on Twitter During the SARS-CoV-2 Outbreak in Europe: Infoveillance Study.欧洲SARS-CoV-2疫情期间推特上新冠疫情相关信息传播的时间和地点变化以及链接类别：信息监测研究

J Med Internet Res. 2020 Aug 28;22(8):e19629. doi: 10.2196/19629.

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.利用自我报告的全球推文识别潜在莱姆病病例：通过表情符号增强带有情感词汇的深度学习模型。

J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014.

Uncovering the Reasons Behind COVID-19 Vaccine Hesitancy in Serbia: Sentiment-Based Topic Modeling.揭示塞尔维亚人对 COVID-19 疫苗犹豫不决的原因：基于情绪的主题建模。

J Med Internet Res. 2022 Nov 17;24(11):e42261. doi: 10.2196/42261.

引用本文的文献

Exploring the potential of online social listening for noncommunicable disease monitoring.探索在线社交倾听在非传染性疾病监测中的潜力。

PeerJ. 2025 May 20;13:e19311. doi: 10.7717/peerj.19311. eCollection 2025.

Leveraging Large Language Models for Infectious Disease Surveillance-Using a Web Service for Monitoring COVID-19 Patterns From Self-Reporting Tweets: Content Analysis.利用大语言模型进行传染病监测——使用网络服务监测来自自我报告推文的新冠疫情模式：内容分析

J Med Internet Res. 2025 Feb 20;27:e63190. doi: 10.2196/63190.

When Infodemic Meets Epidemic: Systematic Literature Review.当信息疫情遇上疫情：系统文献综述

JMIR Public Health Surveill. 2025 Feb 3;11:e55642. doi: 10.2196/55642.

Internet-based surveillance to track trends in seasonal allergies across the United States.基于互联网的监测，以追踪全美国季节性过敏的趋势。

PNAS Nexus. 2024 Oct 29;3(10):pgae430. doi: 10.1093/pnasnexus/pgae430. eCollection 2024 Oct.

A Novel Approach for the Early Detection of Medical Resource Demand Surges During Health Care Emergencies: Infodemiology Study of Tweets.一种在医疗紧急情况期间早期检测医疗资源需求激增的新方法：推文的信息流行病学研究

JMIR Form Res. 2024 Jan 29;8:e46087. doi: 10.2196/46087.

Development of an early alert model for pandemic situations in Germany.德国大流行早期预警模型的建立。

Sci Rep. 2023 Nov 27;13(1):20780. doi: 10.1038/s41598-023-48096-3.

Understanding Public Perceptions and Discussions on Opioids Through Twitter: Cross-Sectional Infodemiology Study.通过 Twitter 了解公众对阿片类药物的看法和讨论：跨-sectional Infodemiology 研究。

J Med Internet Res. 2023 Oct 31;25:e50013. doi: 10.2196/50013.

Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach.利用社交媒体帮助了解新冠后症状患者报告的健康结果：自然语言处理方法。

J Med Internet Res. 2023 Sep 19;25:e45767. doi: 10.2196/45767.

Implicit Incentives Among Reddit Users to Prioritize Attention Over Privacy and Reveal Their Faces When Discussing Direct-to-Consumer Genetic Test Results: Topic and Attention Analysis.Reddit用户中存在的隐性激励因素：在讨论直接面向消费者的基因检测结果时，优先考虑关注度而非隐私并透露自己的面容——主题与关注度分析

JMIR Infodemiology. 2022 Aug 3;2(2):e35702. doi: 10.2196/35702. eCollection 2022 Jul-Dec.

Dynamics of social media behavior before and after SARS-CoV-2 infection.新冠病毒感染前后社交媒体行为的动态变化。

Front Public Health. 2023 Feb 23;10:1069931. doi: 10.3389/fpubh.2022.1069931. eCollection 2022.

本文引用的文献

COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter.COVID-Twitter-BERT：一种用于分析推特上新冠疫情相关内容的自然语言处理模型。

Front Artif Intell. 2023 Mar 14;6:1023281. doi: 10.3389/frai.2023.1023281. eCollection 2023.

A chronological and geographical analysis of personal reports of COVID-19 on Twitter from the UK.对来自英国的推特上关于新冠疫情的个人报告进行的时间顺序和地理分析。

Digit Health. 2022 May 5;8:20552076221097508. doi: 10.1177/20552076221097508. eCollection 2022 Jan-Dec.

J Biomed Inform. 2020;112S:100076. doi: 10.1016/j.yjbinx.2020.100076. Epub 2020 Aug 8.

Mining twitter to explore the emergence of COVID-19 symptoms.挖掘推特以探索新冠病毒症状的出现。

Public Health Nurs. 2020 Nov;37(6):934-940. doi: 10.1111/phn.12809. Epub 2020 Sep 16.

Identification of Risk Factors and Symptoms of COVID-19: Analysis of Biomedical Literature and Social Media Data.新型冠状病毒肺炎风险因素及症状识别：生物医学文献与社交媒体数据分析

J Med Internet Res. 2020 Oct 2;22(10):e20509. doi: 10.2196/20509.

Tracking Mental Health and Symptom Mentions on Twitter During COVID-19.追踪新冠疫情期间推特上的心理健康及症状提及情况

J Gen Intern Med. 2020 Sep;35(9):2798-2800. doi: 10.1007/s11606-020-05988-8. Epub 2020 Jul 7.

Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource.在 Twitter 上自我报告的 COVID-19 症状：分析与研究资源。

J Am Med Inform Assoc. 2020 Aug 1;27(8):1310-1315. doi: 10.1093/jamia/ocaa116.

Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.基于机器学习的方法在推特上检测与 COVID-19 相关的自我报告症状、检测途径和康复情况：回顾性大数据信息监测研究。

JMIR Public Health Surveill. 2020 Jun 8;6(2):e19509. doi: 10.2196/19509.

Predicting COVID-19 Incidence Using Anosmia and Other COVID-19 Symptomatology: Preliminary Analysis Using Google and Twitter.使用嗅觉丧失和其他 COVID-19 症状预测 COVID-19 发病率：使用 Google 和 Twitter 进行的初步分析

Otolaryngol Head Neck Surg. 2020 Sep;163(3):491-497. doi: 10.1177/0194599820932128. Epub 2020 Jun 2.

Real-time tracking of self-reported symptoms to predict potential COVID-19.实时跟踪自我报告的症状以预测潜在的 COVID-19。

Nat Med. 2020 Jul;26(7):1037-1040. doi: 10.1038/s41591-020-0916-2. Epub 2020 May 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于追踪 COVID-19 的 Twitter：自然语言处理管道和探索性数据集。

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献