Kendra Rachel Lynn, Karki Suman, Eickholt Jesse Lee, Gandy Lisa
Department of Computer Science, Central Michigan University, Mount Pleasant, MI, United States.
J Med Internet Res. 2015 Jun 19;17(6):e154. doi: 10.2196/jmir.4220.
User content posted through Twitter has been used for biosurveillance, to characterize public perception of health-related topics, and as a means of distributing information to the general public. Most of the existing work surrounding Twitter and health care has shown Twitter to be an effective medium for these problems but more could be done to provide finer and more efficient access to all pertinent data. Given the diversity of user-generated content, small samples or summary presentations of the data arguably omit a large part of the virtual discussion taking place in the Twittersphere. Still, managing, processing, and querying large amounts of Twitter data is not a trivial task. This work describes tools and techniques capable of handling larger sets of Twitter data and demonstrates their use with the issue of antibiotics.
This work has two principle objectives: (1) to provide an open-source means to efficiently explore all collected tweets and query health-related topics on Twitter, specifically, questions such as what users are saying and how messages are spread, and (2) to characterize the larger discourse taking place on Twitter with respect to antibiotics.
Open-source software suites Hadoop, Flume, and Hive were used to collect and query a large number of Twitter posts. To classify tweets by topic, a deep network classifier was trained using a limited number of manually classified tweets. The particular machine learning approach used also allowed the use of a large number of unclassified tweets to increase performance.
Query-based analysis of the collected tweets revealed that a large number of users contributed to the online discussion and that a frequent topic mentioned was resistance. A number of prominent events related to antibiotics led to a number of spikes in activity but these were short in duration. The category-based classifier developed was able to correctly classify 70% of manually labeled tweets (using a 10-fold cross validation procedure and 9 classes). The classifier also performed well when evaluated on a per category basis.
Using existing tools such as Hive, Flume, Hadoop, and machine learning techniques, it is possible to construct tools and workflows to collect and query large amounts of Twitter data to characterize the larger discussion taking place on Twitter with respect to a particular health-related topic. Furthermore, using newer machine learning techniques and a limited number of manually labeled tweets, an entire body of collected tweets can be classified to indicate what topics are driving the virtual, online discussion. The resulting classifier can also be used to efficiently explore collected tweets by category and search for messages of interest or exemplary content.
通过推特发布的用户内容已被用于生物监测、刻画公众对健康相关话题的看法以及作为向公众传播信息的一种手段。围绕推特与医疗保健的现有大多数工作表明,推特是解决这些问题的有效媒介,但在提供对所有相关数据更精细、更高效的访问方面仍有更多工作可做。鉴于用户生成内容的多样性,数据的小样本或汇总呈现可能会忽略推特领域中正在进行的大量虚拟讨论。尽管如此,管理、处理和查询大量推特数据并非易事。这项工作描述了能够处理更大规模推特数据的工具和技术,并展示了它们在抗生素问题上的应用。
这项工作有两个主要目标:(1)提供一种开源方法,以有效地探索所有收集到的推文并查询推特上与健康相关的话题,具体而言,诸如用户在说什么以及信息如何传播等问题,以及(2)刻画推特上关于抗生素的更大规模的讨论。
使用开源软件套件Hadoop、Flume和Hive来收集和查询大量推特帖子。为了按主题对推文进行分类,使用有限数量的人工分类推文训练了一个深度网络分类器。所使用的特定机器学习方法还允许使用大量未分类推文来提高性能。
对收集到的推文进行基于查询的分析表明,大量用户参与了在线讨论,且频繁提及的一个主题是耐药性。一些与抗生素相关的重大事件导致了活动的多次激增,但持续时间较短。开发的基于类别的分类器能够正确分类70%的人工标注推文(使用10折交叉验证程序和9个类别)。在按类别评估时,该分类器也表现良好。
使用诸如Hive、Flume、Hadoop等现有工具以及机器学习技术,可以构建工具和工作流程来收集和查询大量推特数据,以刻画推特上关于特定健康相关话题的更大规模讨论。此外,使用更新的机器学习技术和有限数量的人工标注推文,可以对整个收集到的推文进行分类,以表明哪些话题推动了虚拟的在线讨论。所得的分类器还可用于按类别有效地探索收集到的推文,并搜索感兴趣的消息或示例内容。