Babinski Tyler, Karley Sara, Cooper Marita, Shaik Salma, Wang Y Ken
Division of Gastroenterology, Hepatology, and Nutrition, Children's Hospital of Philadelphia, Philadelphia, PA, United States.
Division of Management and Education, University of Pittsburgh at Bradford, Bradford, PA, United States.
J Med Internet Res. 2025 Jul 3;27:e53332. doi: 10.2196/53332.
Inflammatory bowel disease (IBD) is a chronic autoimmune disorder with an increasing prevalence in the general population. Internet-based communities have become vital for communication among patients with IBD, especially throughout the COVID-19 pandemic. However, these internet-based patient-to-patient communications remain largely underexplored.
This study aims to analyze community posts from 3 of the largest IBD support groups on Reddit between March 1, 2020, and December 31, 2022, using a pretrained transformer model, and to validate the classification system's results via comparison to human scoring.
We collected posts (N=53,333) from subreddits r/CrohnsDisease, r/UlcerativeColitis, and r/IBD and classified them using OpenAI's GPT-3.5 Turbo model to determine sentiment, categorize topics, and identify demographic information and mentions of the COVID-19 pandemic. A subset of posts (n=397) was manually scored to measure interrater agreement between human raters and the GPT-3.5 Turbo model.
Fleiss κ and Gwet AC1 coefficients indicated a high level of agreement between raters, with values ranging from 0.53 to 0.91. The raters demonstrated almost perfect agreement on the classification of gender, with a Fleiss κ of 0.91 (P<.001). Medications (14,909/53,333) and symptoms (14,939/53,333) emerged as the most discussed topics, and most posts conveyed a neutral sentiment. While most users did not disclose their age, those who did primarily belonged to the 20-29 years (2392/4828) and 30-39 years (859/4828) age groups. Based on self-reported gender, we identified 1509 men and 1502 women among our IBD Reddit users. When comparing the users on the IBD subreddits to the general IBD population, there was a significant difference in gender distribution (N=3,090,011; χ=69.53; P<.001; φ<0.001). After an initial spike in posts within the first month, most posts did not reference the COVID-19 pandemic.
Our study showcases the potential of generative pretrained transformer models in processing and extracting insights from medical social media data. Future research can benefit from further subanalyses of our validated dataset or use OpenAI's model to analyze social media data for other conditions, particularly those for which patient experiences are challenging to collect.
炎症性肠病(IBD)是一种慢性自身免疫性疾病,在普通人群中的患病率呈上升趋势。基于互联网的社区对于IBD患者之间的交流变得至关重要,尤其是在整个新冠疫情期间。然而,这些基于互联网的患者之间的交流在很大程度上仍未得到充分探索。
本研究旨在使用预训练的Transformer模型分析2020年3月1日至2022年12月31日期间Reddit上3个最大的IBD支持小组的社区帖子,并通过与人工评分比较来验证分类系统的结果。
我们从子版块r/CrohnsDisease、r/UlcerativeColitis和r/IBD收集了帖子(N = 53333),并使用OpenAI的GPT-3.5 Turbo模型对其进行分类,以确定情感、对主题进行分类,并识别人口统计学信息以及提及的新冠疫情。手动对一部分帖子(n = 397)进行评分,以测量人工评分者与GPT-3.5 Turbo模型之间的评分者间一致性。
Fleiss κ和Gwet AC1系数表明评分者之间具有高度一致性,值范围为0.53至0.91。评分者在性别分类上表现出几乎完美的一致性,Fleiss κ为0.91(P <.001)。药物(14909/53333)和症状(14939/53333)是讨论最多的主题,大多数帖子传达出中性情感。虽然大多数用户未透露其年龄,但透露年龄的用户主要属于20 - 29岁(2392/4828)和30 - 三十九岁(859/4828)年龄组。根据自我报告的性别,我们在IBD Reddit用户中识别出1509名男性和1502名女性。将IBD子版块上的用户与一般IBD人群进行比较时,性别分布存在显著差异(N = 3090011;χ = 69.53;P <.001;φ < 0.001)。在第一个月内帖子数量出现初始峰值后,大多数帖子未提及新冠疫情。
我们的研究展示了生成式预训练Transformer模型在处理和从医学社交媒体数据中提取见解方面的潜力。未来的研究可以从对我们经过验证的数据集进行进一步的子分析中受益,或者使用OpenAI的模型来分析其他疾病的社交媒体数据,特别是那些患者体验难以收集的疾病。