Karas Bradley, Qu Sue, Xu Yanji, Zhu Qian
Division of Rare Diseases Research Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD, United States.
Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences, (NCATS), National Institutes of Health (NIH), Rockville, MD, United States.
Front Artif Intell. 2022 Aug 18;5:948313. doi: 10.3389/frai.2022.948313. eCollection 2022.
Social media has become an important resource for discussing, sharing, and seeking information pertinent to rare diseases by patients and their families, given the low prevalence in the extraordinarily sparse populations. In our previous study, we identified prevalent topics from Reddit via topic modeling for cystic fibrosis (CF). While we were able to derive/access concerns/needs/questions of patients with CF, we observed challenges and issues with the traditional techniques of topic modeling, e.g., Latent Dirichlet Allocation (LDA), for fulfilling the task of topic extraction. Thus, here we present our experiments to extend the previous study with an aim of improving the performance of topic modeling, by experimenting with LDA model optimization and examination of the Top2Vec model with different embedding models. With the demonstrated results with higher coherence and qualitatively higher human readability of derived topics, we implemented the Top2Vec model with doc2vec as the embedding model as our final model to extract topics from a subreddit of CF ("r/CysticFibrosis") and proposed to expand its use with other types of social media data for other rare diseases for better assessing patients' needs with social media data.
鉴于罕见病在极其稀少的人群中患病率较低,社交媒体已成为患者及其家属讨论、分享和寻求与罕见病相关信息的重要资源。在我们之前的研究中,我们通过对囊性纤维化(CF)的主题建模,从Reddit上识别出了流行话题。虽然我们能够得出/了解CF患者的担忧/需求/问题,但我们观察到传统的主题建模技术,如潜在狄利克雷分配(LDA),在完成主题提取任务时存在挑战和问题。因此,在这里我们展示我们的实验,以扩展之前的研究,目的是通过对LDA模型进行优化实验以及使用不同嵌入模型对Top2Vec模型进行检验,来提高主题建模的性能。通过所展示的结果,即导出的主题具有更高的连贯性和更高的定性人类可读性,我们将以doc2vec作为嵌入模型的Top2Vec模型作为最终模型,从CF的一个子版块(“r/CysticFibrosis”)中提取主题,并建议将其与其他类型的社交媒体数据一起用于其他罕见病,以便更好地利用社交媒体数据评估患者的需求。