Zhou Sicheng, Zhao Yunpeng, Bian Jiang, Haynos Ann F, Zhang Rui
Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States.
Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainsville, FL, United States.
JMIR Med Inform. 2020 Oct 30;8(10):e18273. doi: 10.2196/18273.
Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale.
This study aims to develop and validate a machine learning-based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method.
We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method-Correlation Explanation (CorEx)-to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules.
A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F score=0.89) and then promotional versus published by laypeople (F score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert.
A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning-based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.
饮食失调是一组对身心健康都有不利影响的精神疾病。随着社交媒体平台(如推特)成为公共卫生研究的重要数据源,一些研究已定性探索了在这些平台上讨论饮食失调的方式。初步结果表明,此类研究为进一步了解这组疾病提供了一种很有前景的方法。然而,需要一种有效的计算方法来在更大规模上进一步识别和分析与饮食失调相关的推文。
本研究旨在开发并验证一种基于机器学习的分类器,以识别与饮食失调相关的推文,并使用主题建模方法探索与饮食失调相关的因素(即主题)。
我们使用先前研究中的关键词收集了潜在的与饮食失调相关的推文,并将这些推文标注为不同类别(即与饮食失调相关 vs 不相关,然后是促销信息 vs 普通人讨论)。使用标注数据开发并评估了几种监督式机器学习方法,如卷积神经网络(CNN)、长短期记忆网络(LSTM)、支持向量机和朴素贝叶斯。我们使用性能最佳的分类器来识别与饮食失调相关的推文,并应用一种主题建模方法——相关性解释(CorEx)——来分析所识别推文的内容。为了验证这些机器学习结果,我们还根据人工制定的规则收集了一组与饮食失调相关的推文。
在设定时间段内共收集到123,977条推文。我们随机标注了2219条推文用于开发机器学习分类器。我们开发了一种CNN - LSTM分类器,分两步识别普通人发布的与饮食失调相关的推文:首先是相关与不相关(F值 = 0.89),然后是促销与普通人发布(F值 = 0.90)。使用CNN - LSTM分类器共识别出40,790条与饮食失调相关的推文。我们还使用人工指定规则识别出了另一组由普通人发布的推文(即17,632条与饮食失调相关的推文和83,557条与饮食失调不相关的推文)。对所有与饮食失调相关的推文应用CorEx,主题模型识别出162个主题。总体而言,主题建模的连贯率为77.07%(1264/1640),表明所生成主题的质量较高。领域专家对这些主题进行了进一步审查和分析。
与传统的基于人工的方法相比,开发的CNN - LSTM分类器可以提高识别与饮食失调相关推文的效率。CorEx主题模型分别应用于基于机器学习的分类器和传统人工方法识别出的推文。在这两组推文之间观察到高度重叠的主题。领域专家对所生成的主题进行了进一步审查。潜在的与饮食失调相关的推文所识别出的一些主题可能为理解这组严重疾病提供新途径。