Collier Nigel, Son Nguyen Truong, Nguyen Ngoc Mai
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku,Tokyo, Japan.
J Biomed Semantics. 2011 Oct 6;2 Suppl 5(Suppl 5):S9. doi: 10.1186/2041-1480-2-S5-S9.
Micro-blogging services such as Twitter offer the potential to crowdsource epidemics in real-time. However, Twitter posts ('tweets') are often ambiguous and reactive to media trends. In order to ground user messages in epidemic response we focused on tracking reports of self-protective behaviour such as avoiding public gatherings or increased sanitation as the basis for further risk analysis.
We created guidelines for tagging self protective behaviour based on Jones and Salathé (2009)'s behaviour response survey. Applying the guidelines to a corpus of 5283 Twitter messages related to influenza like illness showed a high level of inter-annotator agreement (kappa 0.86). We employed supervised learning using unigrams, bigrams and regular expressions as features with two supervised classifiers (SVM and Naive Bayes) to classify tweets into 4 self-reported protective behaviour categories plus a self-reported diagnosis. In addition to classification performance we report moderately strong Spearman's Rho correlation by comparing classifier output against WHO/NREVSS laboratory data for A(H1N1) in the USA during the 2009-2010 influenza season.
The study adds to evidence supporting a high degree of correlation between pre-diagnostic social media signals and diagnostic influenza case data, pointing the way towards low cost sensor networks. We believe that the signals we have modelled may be applicable to a wide range of diseases.
诸如推特这样的微博服务提供了对流行病进行实时众包的潜力。然而,推特帖子(“推文”)往往含糊不清,且对媒体趋势有反应。为了将用户信息与疫情应对联系起来,我们专注于追踪诸如避免公众集会或加强卫生措施等自我保护行为的报告,以此作为进一步风险分析的基础。
我们根据琼斯和萨拉泰(2009年)的行为反应调查创建了标记自我保护行为的指南。将这些指南应用于一个包含5283条与流感样疾病相关的推特消息的语料库,结果显示注释者之间的一致性很高(卡帕值为0.86)。我们使用单字、双字和正则表达式作为特征,通过两个监督分类器(支持向量机和朴素贝叶斯)进行监督学习,将推文分类为4种自我报告的保护行为类别以及一种自我报告的诊断。除了分类性能外,我们还通过将分类器输出与2009 - 2010年流感季节美国A(H1N1)的世卫组织/国家呼吸道和肠道病毒监测系统实验室数据进行比较,报告了中等强度的斯皮尔曼等级相关系数。
该研究进一步证明了诊断前社交媒体信号与诊断性流感病例数据之间存在高度相关性,为低成本传感器网络指明了方向。我们认为我们所建模的信号可能适用于多种疾病。