Denecke K, Krieck M, Otrusina L, Smrz P, Dolog P, Nejdl W, Velasco E
Innovation Center Computer Assisted Surgery, Leipzig, Germany.
Methods Inf Med. 2013;52(4):326-39. doi: 10.3414/ME12-02-0010. Epub 2013 Jul 23.
Detecting hints to public health threats as early as possible is crucial to prevent harm from the population. However, many disease surveillance strategies rely upon data whose collection requires explicit reporting (data transmitted from hospitals, laboratories or physicians). Collecting reports takes time so that the reaction time grows. Moreover, context information on individual cases is often lost in the collection process. This paper describes a system that tries to address these limitations by processing social media for identifying information on public health threats. The primary objective is to study the usefulness of the approach for supporting the monitoring of a population's health status.
The developed system works in three main steps: Data from Twitter, blogs, and forums as well as from TV and radio channels are continuously collected and filtered by means of keyword lists. Sentences of relevant texts are classified relevant or irrelevant using a binary classifier based on support vector machines. By means of statistical methods known from biosurveillance, the relevant sentences are further analyzed and signals are generated automatically when unexpected behavior is detected. From the generated signals a subset is selected for presentation to a user by matching with user queries or profiles. In a set of evaluation experiments, public health experts assessed the generated signals with respect to correctness and relevancy. In particular, it was assessed how many relevant and irrelevant signals are generated during a specific time period.
The experiments show that the system provides information on health events identified in social media. Signals are mainly generated from Twitter messages posted by news agencies. Personal tweets, i.e. tweets from persons observing some symptoms, only play a minor role for signal generation given a limited volume of relevant messages. Relevant signals referring to real world outbreaks were generated by the system and monitored by epidemiologists for example during the European football championship. But, the number of relevant signals among generated signals is still very small: The different experiments yielded a proportion between 5 and 20% of signals regarded as "relevant" by the users. Vaccination or education campaigns communicated via Twitter as well as use of medical terms in other contexts than for outbreak reporting led to the generation of irrelevant signals.
The aggregation of information into signals results in a reduction of monitoring effort compared to other existing systems. Against expectations, only few messages are of personal nature, reporting on personal symptoms. Instead, media reports are distributed over social media channels. Despite the high percentage of irrelevant signals generated by the system, the users reported that the effort in monitoring aggregated information in form of signals is less demanding than monitoring huge social-media data streams manually. It remains for the future to develop strategies for reducing false alarms.
尽早发现对公共卫生的威胁迹象对于保护民众免受伤害至关重要。然而,许多疾病监测策略依赖于需要明确报告的数据(从医院、实验室或医生处传输的数据)。收集报告需要时间,从而导致反应时间延长。此外,关于个别病例的背景信息在收集过程中常常丢失。本文描述了一个试图通过处理社交媒体来识别公共卫生威胁信息,以解决这些局限性的系统。主要目标是研究该方法对支持监测人群健康状况的有用性。
所开发的系统主要通过三个步骤运行:来自推特、博客、论坛以及电视和广播频道的数据通过关键词列表持续收集并过滤。相关文本的句子使用基于支持向量机的二元分类器分类为相关或不相关。通过生物监测中已知的统计方法,对相关句子进行进一步分析,并在检测到异常行为时自动生成信号。通过与用户查询或个人资料匹配,从生成的信号中选择一个子集呈现给用户。在一组评估实验中,公共卫生专家评估了生成信号的正确性和相关性。特别是,评估了在特定时间段内生成了多少相关和不相关信号。
实验表明该系统提供了社交媒体中识别出的健康事件信息。信号主要来自新闻机构发布的推特消息。个人推文,即观察到某些症状的人发布的推文,鉴于相关消息数量有限,在信号生成中只起次要作用。例如在欧洲足球锦标赛期间,系统生成了与现实世界疫情相关的信号并由流行病学家进行监测。但是,生成信号中相关信号的数量仍然非常少:不同实验产生的被用户视为“相关”的信号比例在5%到20%之间。通过推特传播的疫苗接种或教育活动以及在疫情报告以外的其他背景下使用医学术语导致了不相关信号的产生。
与其他现有系统相比,将信息汇总为信号可减少监测工作量。与预期相反,只有很少的消息是个人性质的,报告个人症状。相反,媒体报道通过社交媒体渠道传播。尽管系统生成的不相关信号比例很高,但用户报告称,以信号形式监测汇总信息比手动监测大量社交媒体数据流的工作量要小。未来仍需制定减少误报的策略。