Suppr超能文献

利用社交媒体帮助了解新冠后症状患者报告的健康结果:自然语言处理方法。

Using Social Media to Help Understand Patient-Reported Health Outcomes of Post-COVID-19 Condition: Natural Language Processing Approach.

机构信息

Faculty of Health, School of Health Policy and Management, York University, Toronto, ON, Canada.

Vector Institute, Toronto, ON, Canada.

出版信息

J Med Internet Res. 2023 Sep 19;25:e45767. doi: 10.2196/45767.

Abstract

BACKGROUND

While scientific knowledge of post-COVID-19 condition (PCC) is growing, there remains significant uncertainty in the definition of the disease, its expected clinical course, and its impact on daily functioning. Social media platforms can generate valuable insights into patient-reported health outcomes as the content is produced at high resolution by patients and caregivers, representing experiences that may be unavailable to most clinicians.

OBJECTIVE

In this study, we aimed to determine the validity and effectiveness of advanced natural language processing approaches built to derive insight into PCC-related patient-reported health outcomes from social media platforms Twitter and Reddit. We extracted PCC-related terms, including symptoms and conditions, and measured their occurrence frequency. We compared the outputs with human annotations and clinical outcomes and tracked symptom and condition term occurrences over time and locations to explore the pipeline's potential as a surveillance tool.

METHODS

We used bidirectional encoder representations from transformers (BERT) models to extract and normalize PCC symptom and condition terms from English posts on Twitter and Reddit. We compared 2 named entity recognition models and implemented a 2-step normalization task to map extracted terms to unique concepts in standardized terminology. The normalization steps were done using a semantic search approach with BERT biencoders. We evaluated the effectiveness of BERT models in extracting the terms using a human-annotated corpus and a proximity-based score. We also compared the validity and reliability of the extracted and normalized terms to a web-based survey with more than 3000 participants from several countries.

RESULTS

UmlsBERT-Clinical had the highest accuracy in predicting entities closest to those extracted by human annotators. Based on our findings, the top 3 most commonly occurring groups of PCC symptom and condition terms were systemic (such as fatigue), neuropsychiatric (such as anxiety and brain fog), and respiratory (such as shortness of breath). In addition, we also found novel symptom and condition terms that had not been categorized in previous studies, such as infection and pain. Regarding the co-occurring symptoms, the pair of fatigue and headaches was among the most co-occurring term pairs across both platforms. Based on the temporal analysis, the neuropsychiatric terms were the most prevalent, followed by the systemic category, on both social media platforms. Our spatial analysis concluded that 42% (10,938/26,247) of the analyzed terms included location information, with the majority coming from the United States, United Kingdom, and Canada.

CONCLUSIONS

The outcome of our social media-derived pipeline is comparable with the results of peer-reviewed articles relevant to PCC symptoms. Overall, this study provides unique insights into patient-reported health outcomes of PCC and valuable information about the patient's journey that can help health care providers anticipate future needs.

INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.1101/2022.12.14.22283419.

摘要

背景

尽管人们对新冠后遗症(PCC)的科学认识在不断增加,但该病的定义、预期临床过程及其对日常功能的影响仍存在很大不确定性。社交媒体平台可以提供有价值的患者报告的健康结果见解,因为这些内容是由患者和护理人员以高分辨率生成的,代表了大多数临床医生可能无法获得的经验。

目的

在这项研究中,我们旨在确定从社交媒体平台 Twitter 和 Reddit 中提取与 PCC 相关的患者报告健康结果的先进自然语言处理方法的有效性和准确性。我们提取了与 PCC 相关的术语,包括症状和病情,并测量了它们的出现频率。我们将输出结果与人类注释和临床结果进行了比较,并跟踪了症状和病情术语随时间和地点的出现情况,以探索该管道作为监测工具的潜力。

方法

我们使用双向编码器表示转换器(BERT)模型从 Twitter 和 Reddit 上的英语帖子中提取和规范化与 PCC 相关的症状和病情术语。我们比较了 2 种命名实体识别模型,并实施了 2 步规范化任务,将提取的术语映射到标准化术语中的唯一概念。规范化步骤是使用 BERT 双编码器进行语义搜索来完成的。我们使用人工注释语料库和基于接近度的分数来评估 BERT 模型提取术语的有效性。我们还将提取和规范化术语的有效性和可靠性与来自多个国家的 3000 多名参与者的基于网络的调查进行了比较。

结果

UmlsBERT-Clinical 在预测与人类注释者提取的实体最接近的实体方面具有最高的准确性。根据我们的发现,PCC 症状和病情术语中最常见的 3 组是全身性(如疲劳)、神经精神性(如焦虑和脑雾)和呼吸系统(如呼吸急促)。此外,我们还发现了一些以前研究中没有分类的新的症状和病情术语,例如感染和疼痛。关于共发症状,疲劳和头痛是两个平台上最常见的共发症状对。基于时间分析,神经精神症状在两个社交媒体平台上都是最常见的,其次是全身性症状。我们的空间分析得出的结论是,42%(26247 个中的 10938 个)分析的术语包含位置信息,其中大部分来自美国、英国和加拿大。

结论

我们从社交媒体中提取的管道的结果与与 PCC 症状相关的同行评审文章的结果相当。总的来说,这项研究提供了对 PCC 患者报告的健康结果的独特见解,并提供了有关患者旅程的宝贵信息,这有助于医疗保健提供者预测未来的需求。

国际注册报告标识符(IRRID):RR2-10.1101/2022.12.14.22283419。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7127/10510753/7c081e84b8b7/jmir_v25i1e45767_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验