Foufi Vasiliki, Timakum Tatsawan, Gaudet-Blavignac Christophe, Lovis Christian, Song Min
Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.
Faculty of Medicine, University of Geneva, Geneva, Switzerland.
J Med Internet Res. 2019 Jun 13;21(6):e12876. doi: 10.2196/12876.
Social media platforms constitute a rich data source for natural language processing tasks such as named entity recognition, relation extraction, and sentiment analysis. In particular, social media platforms about health provide a different insight into patient's experiences with diseases and treatment than those found in the scientific literature.
This paper aimed to report a study of entities related to chronic diseases and their relation in user-generated text posts. The major focus of our research is the study of biomedical entities found in health social media platforms and their relations and the way people suffering from chronic diseases express themselves.
We collected a corpus of 17,624 text posts from disease-specific subreddits of the social news and discussion website Reddit. For entity and relation extraction from this corpus, we employed the PKDE4J tool developed by Song et al (2015). PKDE4J is a text mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework.
Using PKDE4J, we extracted 2 types of entities and relations: biomedical entities and relations and subject-predicate-object entity relations. In total, 82,138 entities and 30,341 relation pairs were extracted from the Reddit dataset. The most highly mentioned entities were those related to oncological disease (2884 occurrences of cancer) and asthma (2180 occurrences). The relation pair anatomy-disease was the most frequent (5550 occurrences), the highest frequent entities in this pair being cancer and lymph. The manual validation of the extracted entities showed a very good performance of the system at the entity extraction task (3682/5151, 71.48% extracted entities were correctly labeled).
This study showed that people are eager to share their personal experience with chronic diseases on social media platforms despite possible privacy and security issues. The results reported in this paper are promising and demonstrate the need for more in-depth studies on the way patients with chronic diseases express themselves on social media platforms.
社交媒体平台构成了用于自然语言处理任务(如命名实体识别、关系抽取和情感分析)的丰富数据源。特别是,关于健康的社交媒体平台提供了与科学文献中不同的关于患者疾病经历和治疗的见解。
本文旨在报告一项关于用户生成的文本帖子中与慢性病相关的实体及其关系的研究。我们研究的主要重点是健康社交媒体平台中发现的生物医学实体及其关系,以及慢性病患者表达自己的方式。
我们从社交新闻和讨论网站Reddit的特定疾病子版块收集了17624篇文本帖子的语料库。为了从该语料库中提取实体和关系,我们使用了Song等人(2015年)开发的PKDE4J工具。PKDE4J是一个文本挖掘系统,它在一个高度灵活和可扩展的框架中集成了基于字典的实体提取和基于规则的关系提取。
使用PKDE4J,我们提取了两种类型的实体和关系:生物医学实体和关系以及主谓宾实体关系。总共从Reddit数据集中提取了82138个实体和30341个关系对。提及最多的实体是与肿瘤疾病(“癌症”出现2884次)和哮喘(“哮喘”出现2180次)相关的实体。“解剖结构 - 疾病”关系对最为常见(出现5550次),该关系对中出现频率最高的实体是癌症和淋巴。对提取实体的人工验证表明,该系统在实体提取任务中表现非常出色(3682/5151,71.48%的提取实体被正确标注)。
这项研究表明,尽管存在隐私和安全问题,人们仍渴望在社交媒体平台上分享他们的慢性病个人经历。本文报告的结果很有前景,并表明有必要对慢性病患者在社交媒体平台上表达自己的方式进行更深入的研究。