Pervaiz Fahad, Pervaiz Mansoor, Abdur Rehman Nabeel, Saif Umar
School of Science and Engineering, Computer Science Department, Lahore University of Management Sciences, Lahore, Pakistan.
J Med Internet Res. 2012 Oct 4;14(5):e125. doi: 10.2196/jmir.2102.
The Google Flu Trends service was launched in 2008 to track changes in the volume of online search queries related to flu-like symptoms. Over the last few years, the trend data produced by this service has shown a consistent relationship with the actual number of flu reports collected by the US Centers for Disease Control and Prevention (CDC), often identifying increases in flu cases weeks in advance of CDC records. However, contrary to popular belief, Google Flu Trends is not an early epidemic detection system. Instead, it is designed as a baseline indicator of the trend, or changes, in the number of disease cases.
To evaluate whether these trends can be used as a basis for an early warning system for epidemics.
We present the first detailed algorithmic analysis of how Google Flu Trends can be used as a basis for building a fully automated system for early warning of epidemics in advance of methods used by the CDC. Based on our work, we present a novel early epidemic detection system, called FluBreaks (dritte.org/flubreaks), based on Google Flu Trends data. We compared the accuracy and practicality of three types of algorithms: normal distribution algorithms, Poisson distribution algorithms, and negative binomial distribution algorithms. We explored the relative merits of these methods, and related our findings to changes in Internet penetration and population size for the regions in Google Flu Trends providing data.
Across our performance metrics of percentage true-positives (RTP), percentage false-positives (RFP), percentage overlap (OT), and percentage early alarms (EA), Poisson- and negative binomial-based algorithms performed better in all except RFP. Poisson-based algorithms had average values of 99%, 28%, 71%, and 76% for RTP, RFP, OT, and EA, respectively, whereas negative binomial-based algorithms had average values of 97.8%, 17.8%, 60%, and 55% for RTP, RFP, OT, and EA, respectively. Moreover, the EA was also affected by the region's population size. Regions with larger populations (regions 4 and 6) had higher values of EA than region 10 (which had the smallest population) for negative binomial- and Poisson-based algorithms. The difference was 12.5% and 13.5% on average in negative binomial- and Poisson-based algorithms, respectively.
We present the first detailed comparative analysis of popular early epidemic detection algorithms on Google Flu Trends data. We note that realizing this opportunity requires moving beyond the cumulative sum and historical limits method-based normal distribution approaches, traditionally employed by the CDC, to negative binomial- and Poisson-based algorithms to deal with potentially noisy search query data from regions with varying population and Internet penetrations. Based on our work, we have developed FluBreaks, an early warning system for flu epidemics using Google Flu Trends.
谷歌流感趋势服务于2008年推出,用于追踪与流感样症状相关的在线搜索查询量的变化。在过去几年中,该服务生成的趋势数据与美国疾病控制与预防中心(CDC)收集的实际流感报告数量呈现出一致的关系,常常能在CDC记录之前数周就识别出流感病例的增加。然而,与普遍看法相反,谷歌流感趋势并非一个早期疫情检测系统。相反,它被设计为疾病病例数量趋势或变化的基线指标。
评估这些趋势能否作为疫情早期预警系统的基础。
我们首次详细分析了如何将谷歌流感趋势用作构建一个全自动疫情早期预警系统的基础,该系统比CDC所使用的方法更早。基于我们的工作,我们提出了一种基于谷歌流感趋势数据的新型早期疫情检测系统,名为FluBreaks(dritte.org/flubreaks)。我们比较了三种算法的准确性和实用性:正态分布算法、泊松分布算法和负二项分布算法。我们探讨了这些方法的相对优点,并将我们的发现与谷歌流感趋势提供数据的地区的互联网普及率和人口规模变化相关联。
在我们的真阳性百分比(RTP)、假阳性百分比(RFP)、重叠百分比(OT)和早期警报百分比(EA)等性能指标方面,基于泊松和负二项分布的算法在除RFP之外的所有指标上表现更好。基于泊松的算法在RTP、RFP、OT和EA方面的平均值分别为99%、28%、71%和76%,而基于负二项分布的算法在RTP、RFP、OT和EA方面的平均值分别为97.8%、17.8%、60%和55%。此外,EA也受地区人口规模的影响。对于基于负二项分布和泊松的算法,人口较多的地区(地区4和6)的EA值高于地区10(人口最少)。基于负二项分布和泊松的算法的差异平均分别为12.5%和13.5%。
我们首次对基于谷歌流感趋势数据的流行早期疫情检测算法进行了详细的比较分析。我们指出,要实现这一机遇,需要超越CDC传统采用的基于累积求和和历史极限方法的正态分布方法,转而采用基于负二项分布和泊松的算法,以处理来自人口和互联网普及率不同地区的潜在噪声搜索查询数据。基于我们 的工作,我们开发了FluBreaks,这是一种利用谷歌流感趋势的流感疫情早期预警系统。