Generous Nicholas, Fairchild Geoffrey, Deshpande Alina, Del Valle Sara Y, Priedhorsky Reid
Defense Systems and Analysis Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America.
PLoS Comput Biol. 2014 Nov 13;10(11):e1003892. doi: 10.1371/journal.pcbi.1003892. eCollection 2014 Nov.
Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.
传染病是对公共卫生、经济稳定和其他关键社会结构的主要威胁。减轻这些影响的努力依赖于准确及时的监测,以衡量疾病的风险和进展。传统的、以生物学为重点的监测技术准确,但成本高且速度慢;作为回应,基于社交媒体和搜索查询等社会互联网数据的新技术正在兴起。这些努力很有前景,但在科学同行评审、疾病和国家的覆盖范围以及预测等方面的重大挑战阻碍了它们的实际应用。我们研究了一种可免费获取的开放数据源用于此目的:在线百科全书维基百科的访问日志。使用线性模型、以语言作为地点的代理以及系统但简单的文章选择程序,我们测试了14种地点 - 疾病组合,并证明这些数据切实支持一种能够克服这些挑战的方法。具体而言,我们的概念验证产生了决定系数高达0.92的模型、在长达28天的测试期内的预测值,以及几对足够相似的模型,这表明在不重新训练的情况下将模型从一个地点转移到另一个地点是可行的。基于这些初步结果,我们最后提出了一个研究议程,旨在克服这些挑战,并产生一个比当前技术水平显著更有效、更稳健且全球覆盖范围更广的疾病监测和预测系统。