文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

用于追踪 COVID-19 的 Twitter:自然语言处理管道和探索性数据集。

Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.

机构信息

Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314.


DOI:10.2196/25314
PMID:33449904
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7834613/
Abstract

BACKGROUND: In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE: The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS: Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS: Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state-level geolocations. CONCLUSIONS: We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and US state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

摘要

背景:在美国,COVID-19 疫情迅速蔓延,检测试剂短缺,检测结果延迟,这给仅基于检测来主动监测其传播带来了挑战。

目的:本研究旨在开发、评估和部署一个自动自然语言处理管道,以收集用户生成的 Twitter 数据,作为识别美国潜在 COVID-19 病例的补充资源,这些病例不是基于检测的,因此可能没有向疾病控制与预防中心报告。

方法:从 2020 年 1 月 23 日开始,我们从 Twitter 流媒体应用程序编程接口中收集提及 COVID-19 相关关键词的英语推文。我们应用手写正则表达式来识别暗示用户可能接触过 COVID-19 的推文。我们自动从与正则表达式匹配的推文中过滤出“报告性言论”(例如引语、新闻标题),两名注释者对 8976 条带有地理位置标签或个人资料位置元数据的随机样本进行注释,以区分自我报告潜在 COVID-19 病例的推文和未报告的推文。我们使用经过注释的推文来训练和评估基于来自转换器的双向编码器表示的深度神经网络分类器(BERT)。最后,我们在 2020 年 3 月 1 日至 8 月 21 日期间连续收集的超过 8500 万条未标记的推文中部署了自动管道。

结果:基于对 8976 条推文的 3644 条(41%)的双重注释,注释者间的一致性为 0.77(Cohen κ)。基于在与 COVID-19 相关的推文上进行预训练的 BERT 模型的深度神经网络分类器,对自我报告潜在 COVID-19 病例的推文的检测准确率为 0.76(精确率=0.76,召回率=0.76)。在部署我们的自动管道后,我们确定了 13714 条自我报告潜在 COVID-19 病例且具有美国州级地理位置的推文。

结论:我们公开提供了在这项研究中确定的 13714 条推文,以及每条推文的时间戳和美国州级地理位置,以供下载。这个数据集为未来利用 Twitter 数据作为跟踪 COVID-19 传播的补充资源的工作提供了机会。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/0c34d15456c3/jmir_v23i1e25314_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/ac17d234a217/jmir_v23i1e25314_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/0c34d15456c3/jmir_v23i1e25314_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/ac17d234a217/jmir_v23i1e25314_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b5cb/7834613/0c34d15456c3/jmir_v23i1e25314_fig2.jpg

相似文献

[1]
Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.

J Med Internet Res. 2021-1-22

[2]
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.

J Biomed Inform. 2020

[3]
Automatically Identifying Twitter Users for Interventions to Support Dementia Family Caregivers: Annotated Data Set and Benchmark Classification Models.

JMIR Aging. 2022-9-16

[4]
Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid.

J Med Internet Res. 2021-5-3

[5]
Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States.

JMIR Public Health Surveill. 2022-4-25

[6]
ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.

PLoS One. 2022

[7]
Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

J Biomed Inform. 2018-10-4

[8]
Temporal and Location Variations, and Link Categories for the Dissemination of COVID-19-Related Information on Twitter During the SARS-CoV-2 Outbreak in Europe: Infoveillance Study.

J Med Internet Res. 2020-8-28

[9]
Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.

J Med Internet Res. 2023-10-16

[10]
Uncovering the Reasons Behind COVID-19 Vaccine Hesitancy in Serbia: Sentiment-Based Topic Modeling.

J Med Internet Res. 2022-11-17

引用本文的文献

[1]
Exploring the potential of online social listening for noncommunicable disease monitoring.

PeerJ. 2025-5-20

[2]
Leveraging Large Language Models for Infectious Disease Surveillance-Using a Web Service for Monitoring COVID-19 Patterns From Self-Reporting Tweets: Content Analysis.

J Med Internet Res. 2025-2-20

[3]
When Infodemic Meets Epidemic: Systematic Literature Review.

JMIR Public Health Surveill. 2025-2-3

[4]
Internet-based surveillance to track trends in seasonal allergies across the United States.

PNAS Nexus. 2024-10-29

[5]
A Novel Approach for the Early Detection of Medical Resource Demand Surges During Health Care Emergencies: Infodemiology Study of Tweets.

JMIR Form Res. 2024-1-29

[6]
Development of an early alert model for pandemic situations in Germany.

Sci Rep. 2023-11-27

[7]
Understanding Public Perceptions and Discussions on Opioids Through Twitter: Cross-Sectional Infodemiology Study.

J Med Internet Res. 2023-10-31

[8]
Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach.

J Med Internet Res. 2023-9-19

[9]
Implicit Incentives Among Reddit Users to Prioritize Attention Over Privacy and Reveal Their Faces When Discussing Direct-to-Consumer Genetic Test Results: Topic and Attention Analysis.

JMIR Infodemiology. 2022-8-3

[10]
Dynamics of social media behavior before and after SARS-CoV-2 infection.

Front Public Health. 2022

本文引用的文献

[1]
COVID-Twitter-BERT: A natural language processing model to analyse COVID-19 content on Twitter.

Front Artif Intell. 2023-3-14

[2]
A chronological and geographical analysis of personal reports of COVID-19 on Twitter from the UK.

Digit Health. 2022-5-5

[3]
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.

J Biomed Inform. 2020

[4]
Mining twitter to explore the emergence of COVID-19 symptoms.

Public Health Nurs. 2020-9-16

[5]
Identification of Risk Factors and Symptoms of COVID-19: Analysis of Biomedical Literature and Social Media Data.

J Med Internet Res. 2020-10-2

[6]
Tracking Mental Health and Symptom Mentions on Twitter During COVID-19.

J Gen Intern Med. 2020-9

[7]
Self-reported COVID-19 symptoms on Twitter: an analysis and a research resource.

J Am Med Inform Assoc. 2020-8-1

[8]
Machine Learning to Detect Self-Reporting of Symptoms, Testing Access, and Recovery Associated With COVID-19 on Twitter: Retrospective Big Data Infoveillance Study.

JMIR Public Health Surveill. 2020-6-8

[9]
Predicting COVID-19 Incidence Using Anosmia and Other COVID-19 Symptomatology: Preliminary Analysis Using Google and Twitter.

Otolaryngol Head Neck Surg. 2020-6-2

[10]
Real-time tracking of self-reported symptoms to predict potential COVID-19.

Nat Med. 2020-5-11

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索