从Reddit挖掘文本健康信息：利用提取的实体及其关系分析慢性病

Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations.

作者信息

Foufi Vasiliki, Timakum Tatsawan, Gaudet-Blavignac Christophe, Lovis Christian, Song Min

机构信息

Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.

Faculty of Medicine, University of Geneva, Geneva, Switzerland.

出版信息

J Med Internet Res. 2019 Jun 13;21(6):e12876. doi: 10.2196/12876.

DOI:10.2196/12876

PMID:31199327

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6595941/

Abstract

BACKGROUND

Social media platforms constitute a rich data source for natural language processing tasks such as named entity recognition, relation extraction, and sentiment analysis. In particular, social media platforms about health provide a different insight into patient's experiences with diseases and treatment than those found in the scientific literature.

OBJECTIVE

This paper aimed to report a study of entities related to chronic diseases and their relation in user-generated text posts. The major focus of our research is the study of biomedical entities found in health social media platforms and their relations and the way people suffering from chronic diseases express themselves.

METHODS

We collected a corpus of 17,624 text posts from disease-specific subreddits of the social news and discussion website Reddit. For entity and relation extraction from this corpus, we employed the PKDE4J tool developed by Song et al (2015). PKDE4J is a text mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework.

RESULTS

Using PKDE4J, we extracted 2 types of entities and relations: biomedical entities and relations and subject-predicate-object entity relations. In total, 82,138 entities and 30,341 relation pairs were extracted from the Reddit dataset. The most highly mentioned entities were those related to oncological disease (2884 occurrences of cancer) and asthma (2180 occurrences). The relation pair anatomy-disease was the most frequent (5550 occurrences), the highest frequent entities in this pair being cancer and lymph. The manual validation of the extracted entities showed a very good performance of the system at the entity extraction task (3682/5151, 71.48% extracted entities were correctly labeled).

CONCLUSIONS

This study showed that people are eager to share their personal experience with chronic diseases on social media platforms despite possible privacy and security issues. The results reported in this paper are promising and demonstrate the need for more in-depth studies on the way patients with chronic diseases express themselves on social media platforms.

摘要

背景

社交媒体平台构成了用于自然语言处理任务（如命名实体识别、关系抽取和情感分析）的丰富数据源。特别是，关于健康的社交媒体平台提供了与科学文献中不同的关于患者疾病经历和治疗的见解。

目的

本文旨在报告一项关于用户生成的文本帖子中与慢性病相关的实体及其关系的研究。我们研究的主要重点是健康社交媒体平台中发现的生物医学实体及其关系，以及慢性病患者表达自己的方式。

方法

我们从社交新闻和讨论网站Reddit的特定疾病子版块收集了17624篇文本帖子的语料库。为了从该语料库中提取实体和关系，我们使用了Song等人（2015年）开发的PKDE4J工具。PKDE4J是一个文本挖掘系统，它在一个高度灵活和可扩展的框架中集成了基于字典的实体提取和基于规则的关系提取。

结果

使用PKDE4J，我们提取了两种类型的实体和关系：生物医学实体和关系以及主谓宾实体关系。总共从Reddit数据集中提取了82138个实体和30341个关系对。提及最多的实体是与肿瘤疾病（“癌症”出现2884次）和哮喘（“哮喘”出现2180次）相关的实体。“解剖结构 - 疾病”关系对最为常见（出现5550次），该关系对中出现频率最高的实体是癌症和淋巴。对提取实体的人工验证表明，该系统在实体提取任务中表现非常出色（3682/5151，71.48%的提取实体被正确标注）。

结论

这项研究表明，尽管存在隐私和安全问题，人们仍渴望在社交媒体平台上分享他们的慢性病个人经历。本文报告的结果很有前景，并表明有必要对慢性病患者在社交媒体平台上表达自己的方式进行更深入的研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7289/6595941/3b506094e4c6/jmir_v21i6e12876_fig1.jpg

相似文献

Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations.

J Med Internet Res. 2019 Jun 13;21(6):e12876. doi: 10.2196/12876.

PKDE4J: Entity and relation extraction for public knowledge discovery.

J Biomed Inform. 2015 Oct;57:320-32. doi: 10.1016/j.jbi.2015.08.008. Epub 2015 Aug 12.

Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach.

J Med Internet Res. 2023 Sep 19;25:e45767. doi: 10.2196/45767.

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.

BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3.

PREDOSE: a semantic web platform for drug abuse epidemiology using social media.

J Biomed Inform. 2013 Dec;46(6):985-97. doi: 10.1016/j.jbi.2013.07.007. Epub 2013 Jul 25.

Analyzing Reddit Forums Specific to Abortion That Yield Diverse Dialogues Pertaining to Medical Information Seeking and Personal Worldviews: Data Mining and Natural Language Processing Comparative Study.

J Med Internet Res. 2024 Feb 14;26:e47408. doi: 10.2196/47408.

Extracting entities with attributes in clinical text via joint deep learning.

J Am Med Inform Assoc. 2019 Dec 1;26(12):1584-1591. doi: 10.1093/jamia/ocz158.

The Use of Traditional, Complementary, and Integrative Medicine in Cancer: Data-Mining Study of 1 Million Web-Based Posts From Health Forums and Social Media Platforms.

J Med Internet Res. 2023 Apr 21;25:e45408. doi: 10.2196/45408.

Comparing the Discussion of Telehealth in Two Social Media Platforms: Social Listening Analysis.

Telemed Rep. 2023 Aug 3;4(1):236-248. doi: 10.1089/tmr.2023.0008. eCollection 2023.

Extraction of semantic biomedical relations from text using conditional random fields.

BMC Bioinformatics. 2008 Apr 23;9:207. doi: 10.1186/1471-2105-9-207.

引用本文的文献

Sentiment Analysis of Transsphenoidal Surgery in the Cushing's Subreddit.

J Neurol Surg B Skull Base. 2024 Jul 18;86(4):488-494. doi: 10.1055/a-2360-9748. eCollection 2025 Aug.

Informational Justice and Remote Working: All is Not Fair for Work at Home.

Empl Responsib Rights J (Dordr). 2022 Nov 11:1-24. doi: 10.1007/s10672-022-09427-0.

Multimorbidity patterns and early signals of diabetes in online communities.

JAMIA Open. 2025 May 30;8(3):ooaf049. doi: 10.1093/jamiaopen/ooaf049. eCollection 2025 Jun.

"I miss stars, too": a thematic analysis of the experiences of persons with retinitis pigmentosa using Reddit.

J Community Genet. 2025 Apr 29. doi: 10.1007/s12687-025-00796-1.

Reddit users' questions and concerns about glaucoma.

Int Ophthalmol. 2025 Mar 18;45(1):106. doi: 10.1007/s10792-025-03453-1.

Understanding Loneliness Through Analysis of Twitter and Reddit Data: Comparative Study.

Interact J Med Res. 2025 Mar 14;14:e49464. doi: 10.2196/49464.

Understanding Health-Related Discussions on Reddit: Development of a Topic Assignment Method and Exploratory Analysis.

JMIR Form Res. 2025 Jan 29;9:e55309. doi: 10.2196/55309.

Combining Topic Modeling, Sentiment Analysis, and Corpus Linguistics to Analyze Unstructured Web-Based Patient Experience Data: Case Study of Modafinil Experiences.

J Med Internet Res. 2024 Dec 11;26:e54321. doi: 10.2196/54321.

Automated information extraction model enhancing traditional Chinese medicine RCT evidence extraction (Evi-BERT): algorithm development and validation.

Front Artif Intell. 2024 Aug 15;7:1454945. doi: 10.3389/frai.2024.1454945. eCollection 2024.

Patient Information Needs and Decision-Making Before a Cardiac Implantable Electronic Device: A Qualitative Study Utilizing Social Media Data.

J Clin Psychol Med Settings. 2025 Mar;32(1):121-130. doi: 10.1007/s10880-024-10024-6. Epub 2024 May 21.

本文引用的文献

Temporal and Geographic Patterns of Social Media Posts About an Emerging Suicide Game.

J Adolesc Health. 2019 Jul;65(1):94-100. doi: 10.1016/j.jadohealth.2018.12.025. Epub 2019 Feb 26.

Social Media Based Analysis of Opioid Epidemic Using Reddit.

AMIA Annu Symp Proc. 2018 Dec 5;2018:867-876. eCollection 2018.

Tracking Health Related Discussions on Reddit for Public Health Applications.

AMIA Annu Symp Proc. 2018 Apr 16;2017:1362-1371. eCollection 2017.

What Patients Can Tell Us: Topic Analysis for Social Media on Breast Cancer.

JMIR Med Inform. 2017 Jul 31;5(3):e23. doi: 10.2196/medinform.7779.

Motivations and Limitations Associated with Vaping among People with Mental Illness: A Qualitative Analysis of Reddit Discussions.

Int J Environ Res Public Health. 2016 Dec 22;14(1):7. doi: 10.3390/ijerph14010007.

Mining Health Social Media with Sentiment Analysis.

J Med Syst. 2016 Nov;40(11):236. doi: 10.1007/s10916-016-0604-4. Epub 2016 Sep 23.

Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts.

J Biomed Inform. 2016 Aug;62:148-58. doi: 10.1016/j.jbi.2016.06.007. Epub 2016 Jun 27.

Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach.

PLoS One. 2016 May 19;11(5):e0156091. doi: 10.1371/journal.pone.0156091. eCollection 2016.

Symptom clusters in women with breast cancer: an analysis of data from social media and a research study.

Qual Life Res. 2016 Mar;25(3):547-57. doi: 10.1007/s11136-015-1156-7. Epub 2015 Oct 17.

PKDE4J: Entity and relation extraction for public knowledge discovery.

J Biomed Inform. 2015 Oct;57:320-32. doi: 10.1016/j.jbi.2015.08.008. Epub 2015 Aug 12.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr
超能文献

从Reddit挖掘文本健康信息：利用提取的实体及其关系分析慢性病

Mining of Textual Health Information from Reddit: Analysis of Chronic Diseases With Extracted Entities and Their Relations.

作者信息

机构信息