Miller Michele, Banerjee Tanvi, Muppalla Roopteja, Romine William, Sheth Amit
Department of Biological Sciences, Wright State University, Dayton, OH, United States.
Department of Computer Science and Engineering, Wright State University, Dayton, OH, United States.
JMIR Public Health Surveill. 2017 Jun 19;3(2):e38. doi: 10.2196/publichealth.7157.
In order to harness what people are tweeting about Zika, there needs to be a computational framework that leverages machine learning techniques to recognize relevant Zika tweets and, further, categorize these into disease-specific categories to address specific societal concerns related to the prevention, transmission, symptoms, and treatment of Zika virus.
The purpose of this study was to determine the relevancy of the tweets and what people were tweeting about the 4 disease characteristics of Zika: symptoms, transmission, prevention, and treatment.
A combination of natural language processing and machine learning techniques was used to determine what people were tweeting about Zika. Specifically, a two-stage classifier system was built to find relevant tweets about Zika, and then the tweets were categorized into 4 disease categories. Tweets in each disease category were then examined using latent Dirichlet allocation (LDA) to determine the 5 main tweet topics for each disease characteristic.
Over 4 months, 1,234,605 tweets were collected. The number of tweets by males and females was similar (28.47% [351,453/1,234,605] and 23.02% [284,207/1,234,605], respectively). The classifier performed well on the training and test data for relevancy (F1 score=0.87 and 0.99, respectively) and disease characteristics (F1 score=0.79 and 0.90, respectively). Five topics for each category were found and discussed, with a focus on the symptoms category.
We demonstrate how categories of discussion on Twitter about an epidemic can be discovered so that public health officials can understand specific societal concerns within the disease-specific categories. Our two-stage classifier was able to identify relevant tweets to enable more specific analysis, including the specific aspects of Zika that were being discussed as well as misinformation being expressed. Future studies can capture sentiments and opinions on epidemic outbreaks like Zika virus in real time, which will likely inform efforts to educate the public at large.
为了利用人们在推特上发布的有关寨卡病毒的信息,需要一个计算框架,该框架利用机器学习技术来识别相关的寨卡病毒推文,并进一步将这些推文分类到特定疾病类别中,以解决与寨卡病毒预防、传播、症状和治疗相关的特定社会问题。
本研究的目的是确定推文的相关性以及人们在推特上发布的有关寨卡病毒四个疾病特征(症状、传播、预防和治疗)的内容。
结合自然语言处理和机器学习技术来确定人们在推特上发布的有关寨卡病毒的内容。具体而言,构建了一个两阶段分类器系统来查找有关寨卡病毒的相关推文,然后将这些推文分类到4个疾病类别中。然后使用潜在狄利克雷分配(LDA)对每个疾病类别的推文进行检查,以确定每个疾病特征的5个主要推文主题。
在4个月的时间里,共收集到1,234,605条推文。男性和女性发布的推文数量相似(分别为28.47%[351,453/1,234,605]和23.02%[284,207/1,234,605])。该分类器在相关性的训练和测试数据上表现良好(F1分数分别为0.87和0.99)以及疾病特征方面(F1分数分别为0.79和0.90)。每个类别都发现并讨论了5个主题,重点是症状类别。
我们展示了如何发现推特上关于一种流行病的讨论类别,以便公共卫生官员能够了解特定疾病类别中的具体社会问题。我们的两阶段分类器能够识别相关推文,以便进行更具体的分析,包括正在讨论的寨卡病毒的具体方面以及所表达的错误信息。未来的研究可以实时捕捉对寨卡病毒等疫情爆发的情绪和观点,这可能会为向广大公众进行教育的努力提供信息。