Nagar Ruchit, Yuan Qingyu, Freifeld Clark C, Santillana Mauricio, Nojima Aaron, Chunara Rumi, Brownstein John S
Children's Hospital Informatics Program, Boston Children's Hospital, Boston, MA, United States.
J Med Internet Res. 2014 Oct 20;16(10):e236. doi: 10.2196/jmir.3416.
Twitter has shown some usefulness in predicting influenza cases on a weekly basis in multiple countries and on different geographic scales. Recently, Broniatowski and colleagues suggested Twitter's relevance at the city-level for New York City. Here, we look to dive deeper into the case of New York City by analyzing daily Twitter data from temporal and spatiotemporal perspectives. Also, through manual coding of all tweets, we look to gain qualitative insights that can help direct future automated searches.
The intent of the study was first to validate the temporal predictive strength of daily Twitter data for influenza-like illness emergency department (ILI-ED) visits during the New York City 2012-2013 influenza season against other available and established datasets (Google search query, or GSQ), and second, to examine the spatial distribution and the spread of geocoded tweets as proxies for potential cases.
From the Twitter Streaming API, 2972 tweets were collected in the New York City region matching the keywords "flu", "influenza", "gripe", and "high fever". The tweets were categorized according to the scheme developed by Lamb et al. A new fourth category was added as an evaluator guess for the probability of the subject(s) being sick to account for strength of confidence in the validity of the statement. Temporal correlations were made for tweets against daily ILI-ED visits and daily GSQ volume. The best models were used for linear regression for forecasting ILI visits. A weighted, retrospective Poisson model with SaTScan software (n=1484), and vector map were used for spatiotemporal analysis.
Infection-related tweets (R=.763) correlated better than GSQ time series (R=.683) for the same keywords and had a lower mean average percent error (8.4 vs 11.8) for ILI-ED visit prediction in January, the most volatile month of flu. SaTScan identified primary outbreak cluster of high-probability infection tweets with a 2.74 relative risk ratio compared to medium-probability infection tweets at P=.001 in Northern Brooklyn, in a radius that includes Barclay's Center and the Atlantic Avenue Terminal.
While others have looked at weekly regional tweets, this study is the first to stress test Twitter for daily city-level data for New York City. Extraction of personal testimonies of infection-related tweets suggests Twitter's strength both qualitatively and quantitatively for ILI-ED prediction compared to alternative daily datasets mixed with awareness-based data such as GSQ. Additionally, granular Twitter data provide important spatiotemporal insights. A tweet vector-map may be useful for visualization of city-level spread when local gold standard data are otherwise unavailable.
推特已显示出在多个国家和不同地理尺度上每周预测流感病例方面具有一定作用。最近,布罗尼亚托夫斯基及其同事指出了推特在纽约市层面的相关性。在此,我们希望通过从时间和时空角度分析纽约市的每日推特数据,更深入地研究纽约市的情况。此外,通过对所有推文进行人工编码,我们希望获得定性见解,以帮助指导未来的自动搜索。
该研究的目的首先是验证2012 - 2013年纽约市流感季节期间,每日推特数据对流感样疾病急诊科(ILI - ED)就诊情况的时间预测强度,与其他可用的既定数据集(谷歌搜索查询,即GSQ)进行对比;其次,研究地理编码推文的空间分布及其作为潜在病例代理的传播情况。
从推特流式应用程序编程接口(Twitter Streaming API)收集了纽约市地区与关键词“flu”(流感)、“influenza”(流感)、“gripe”(流感)和“high fever”(高烧)匹配的2972条推文。这些推文根据兰姆等人制定的方案进行分类。新增了第四类作为评估者对主题患病概率的猜测,以考虑陈述有效性的置信强度。对推文与每日ILI - ED就诊情况和每日GSQ搜索量进行时间相关性分析。使用最佳模型进行线性回归以预测ILI就诊情况。使用SaTScan软件(n = 1484)的加权回顾性泊松模型和矢量地图进行时空分析。
在流感最不稳定的1月份,与相同关键词的GSQ时间序列相比,感染相关推文(R = 0.763)相关性更好,且在预测ILI - ED就诊情况时平均百分比误差更低(8.4对11.8)。SaTScan在布鲁克林北部识别出高概率感染推文的主要爆发集群区域,与中等概率感染推文相比,相对风险比为2.74,P = 0.001,该区域半径范围内包括巴克莱中心和大西洋大道终点站。
虽然其他人研究过每周的区域推文,但本研究是首次对纽约市每日城市层面数据进行推特压力测试。提取感染相关推文的个人证词表明,与诸如GSQ等混合了基于认知的数据的替代每日数据集相比,推特在ILI - ED预测方面在定性和定量上都具有优势。此外,粒度化的推特数据提供了重要的时空见解。当本地黄金标准数据不可用时,推文矢量地图可能有助于可视化城市层面的传播情况。