Broniatowski David A, Paul Michael J, Dredze Mark
Department of Engineering Management and Systems Engineering, The George Washington University, Washington, District of Columbia, United States of America ; Center for Advanced Modeling in the Social, Behavioral, and Health Sciences, Department of Emergency Medicine, School of Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America.
Department of Computer Science and Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland, United States of America.
PLoS One. 2013 Dec 9;8(12):e83672. doi: 10.1371/journal.pone.0083672. eCollection 2013.
Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases "chatter" - messages that are about influenza but that do not pertain to an actual infection - masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012-2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter.
社交媒体已被提议作为流感监测的数据来源,因为它们有可能提供实时获取数百万条简短、地理位置本地化的信息,这些信息包含有关个人健康状况的内容。然而,社交媒体监测系统的准确性会随着媒体关注度的增加而下降,因为媒体关注度会增加“闲聊”——即与流感相关但与实际感染无关的信息,从而掩盖了真正流感流行的迹象。本文总结了我们最近开发的流感感染检测算法,该算法能自动将相关推文与其他闲聊信息区分开来,并且我们描述了我们当前的流感监测系统,该系统在2012 - 2013年整个流感季期间积极部署。我们的目标是分析该系统在最近的2012 - 2013年流感季期间的表现,并分析其在多个地理粒度层面上的表现,这与以往专注于国家或地区监测的研究不同。我们系统对流感流行程度的估计与美国疾病控制与预防中心的监测数据(r = 0.93,p < 0.001)以及纽约市卫生与精神卫生部门的监测数据(r = 0.88,p < 0.001)高度相关。我们的系统以85%的准确率检测到了流感流行程度每周的变化趋势(上升或下降),比一个更简单的模型准确率提高了近两倍,这表明明确区分感染推文与其他闲聊信息的实用性。