利用 Reddit 数据进行自然语言处理以评估皮肤科患者的体验和治疗效果。

Natural language processing of Reddit data to evaluate dermatology patient experiences and therapeutics.

机构信息

School of Engineering, University of Pennsylvania, Philadelphia, Pennsylvania.

Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania.

出版信息

J Am Acad Dermatol. 2020 Sep;83(3):803-808. doi: 10.1016/j.jaad.2019.07.014. Epub 2019 Jul 12.

Abstract

BACKGROUND

There is a lack of research studying patient-generated data on Reddit, one of the world's most popular forums with active users interested in dermatology. Techniques within natural language processing, a field of artificial intelligence, can analyze large amounts of text information and extract insights.

OBJECTIVE

To apply natural language processing to Reddit comments about dermatology topics to assess for feasibility and potential for insights and engagement.

METHODS

A software pipeline preprocessed Reddit comments from 2005 to 2017 from 7 popular dermatology-related subforums on Reddit, applied latent Dirichlet allocation, and used spectral clustering to establish cohesive themes and the frequency of word representation and grouped terms within these topics.

RESULTS

We created a corpus of 176,000 comments and identified trends in patient engagement in spaces such as eczema and acne, among others, with a focus on homeopathic treatments and isotretinoin.

LIMITATIONS

Latent Dirichlet allocation is an unsupervised model, meaning there is no ground truth to which the model output can be compared. However, because these forums are anonymous, there seems little incentive for patients to be dishonest.

CONCLUSIONS

Reddit data has viability and utility for dermatologic research and engagement with the public, especially for common dermatology topics such as tanning, acne, and psoriasis.

摘要

背景

在世界上最受欢迎的论坛之一 Reddit 上,有关于患者生成数据的研究相对较少,而 Reddit 拥有大量对皮肤科感兴趣的活跃用户。自然语言处理是人工智能的一个领域,它可以分析大量文本信息并从中提取见解。

目的

应用自然语言处理技术分析 Reddit 上关于皮肤科主题的评论,以评估其可行性、潜在洞察力和参与度。

方法

该软件从 2005 年至 2017 年,通过 Reddit 上 7 个热门的皮肤科相关子论坛,对评论进行预处理,应用潜在狄利克雷分配法(LDA),并使用谱聚类来建立有凝聚力的主题以及主题内单词表示的频率和分组术语。

结果

我们创建了一个包含 176000 条评论的语料库,确定了患者在湿疹和痤疮等领域的参与趋势,并重点关注顺势疗法治疗和异维 A 酸。

局限性

潜在狄利克雷分配法是一种无监督模型,这意味着模型输出没有可以与之比较的真实数据。然而,由于这些论坛是匿名的,患者似乎没有不诚实的动机。

结论

Reddit 数据具有皮肤科研究和与公众互动的可行性和实用性,特别是对于晒黑、痤疮和银屑病等常见的皮肤科主题。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索