Paul Michael J, Dredze Mark
Department of Computer Science and Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland, United States of America.
Department of Computer Science and Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland, United States of America; Human Language Technology Center of Excellence and Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, United States of America.
PLoS One. 2014 Aug 1;9(8):e103408. doi: 10.1371/journal.pone.0103408. eCollection 2014.
By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = -.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.
通过汇总数百万用户自我报告的健康状况,我们试图描绘推特上所讨论的各类健康信息。我们描述了一个用于在社交媒体网站推特上发现健康主题的主题建模框架。这是一种探索性方法,目标是了解社交媒体中通常讨论哪些健康主题。本文详细介绍了为此目的创建的一个统计主题模型——疾病主题方面模型(ATAM),以及我们基于健康关键词和监督分类对一般推特数据进行筛选的系统。我们展示了ATAM和其他主题模型如何自动从2011年至2013年的1.44亿条推特消息中推断出健康主题。ATAM发现了13个连贯的推特消息集群,其中一些与季节性流感(r = 0.689)和过敏(r = 0.810)的时间监测数据相关,以及与美国的运动(r = 0.534)和肥胖(r = -0.631)相关的地理调查数据相关。这些结果表明,尽管与先前的工作相比,使用了最少的人工监督且没有历史数据来训练模型,但仍有可能自动发现与真实数据具有统计学显著相关性的主题。此外,这些结果表明,一个通用模型可以识别社交媒体中的许多不同健康主题。